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Brief Reports 


The Journal of Consulting Psychology will 
accept Brief Reports of research studies in 
clinical psychology for early publication with- 
out expense to the author. The procedure is 
intended to permit the publication of soundly 
designed studies of specialized interest or lim- 
ited importance which cannot now be ac- 
cepted because of lack of space. Several pages 
in each issue will be devoted to Brief Reports, 
published in the order of their receipt with- 
out respect to the dates of receipt of the regu- 
lar articles. Most Brief Reports appear in the 
first or second issue to go to press following 
their final acceptance. 


An author who wishes to submit a Brief 
Report: 
1. Sends the Brief Report, limited to one printed 


page and prepared according to the specifications 
given below. 


2. Also sends to the Editor a full report of the re- 
search study, in sufficient detail to give a clear ac- 
count of its background, procedure, results, and con- 
clusions, which will be filed with the American 
Documentation Institute to insure indefinite avail- 
ability. 

3. Prepares at least 100 mimeographed copies of 
the full report, which the author will send without 


charge to all who request it as long as the supply 
lasts. 


4. Agrees not to submit the full report to another 
journal of general circulation. 


Specifications 


Brief Report. The Brief Report should give 
a clear, condensed summary of the procedure 
of the study and as full an account of the re- 
sults as space permits. 

To insure that the Brief Report will be no 
longer than one printed page, its typescript, 
including all matter except the title and the 
author’s lines, must not exceed 75 lines av- 


eraging 42 characters and spaces in length. 
Set the typewriter margins for short lines of 
42 characters, which are 3.5 inches long in 
elite typing, and 4.2 inches long in pica. 

The manuscript of the Brief Report must 
be double spaced throughout. Except for its 
short lines, it follows the standard style (1). 
Headings, tables, and references are avoided 
or, if essential, must be counted in the 75 
lines. Each Brief Report must be accom- 
panied by a footnote in the style below, 
which is typed on a separate sheet and not 
counted in the 75-line quota: * 

1An extended report of this study may be ob- 
tained without charge from John Doe, 300 Market 
St., Prospect 6, Mass. (giving the author’s full name 
and address), or for a fee from the American Docu- 
mentation Institute. Order Document No. ——, re- 


mitting $—— for microfilm or $—— for photo- 
copies. 


Extended report. Because the extended re- 
port is intended for photoduplication, and is 
not copy to be sent to a printer, its style 
should differ in several ways from that of 
other manuscripts: (a) The extended report 
should be typed with single spacing for 
economy in duplication. (4) Tables and fig- 
ures should be placed adjacent to the text 
which refers to them. A caption should be 
typed below each figure. (c) Footnotes should 
be typed at the bottom of the page on which 
reference is made to them. In other respects, 
the full report is prepared in the style speci- 
fied by the Publication Manual (1). 


Reference 


1. American Psychological Association. Council of 
Editors. Publication manuel of the American 
Psychological Association (1957 rev.). Wash- 
ington, D. C.: American Psychological Asso- 
ciation, 1957. 





name on page 131. 





Erratum 


Due to a printer’s error, the author’s name was stated incorrectly on 
page 131 of the April, 1957, issue. The author of the article “Reli- 
ability (Internal Consistency) of the Wechsler Memory Scale and 
Correlation with the Wechsler-Bellevue Intelligence Scale” is: 


Julia C. Hall 


Veterans Administration Hospital, Bronx, New York 


It will be appreciated if subscribers will indicate the correction of the 
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The Results of Psychotherapy with Children: 
An Evaluation 


Eugene E. Levitt 
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A compendium of results of psychotherapy 
with adults was published a few years ago by 
Eysenck (16). It included reports from 24 
sources on more than 8,000 cases treated by an 
assortment of psychotherapeutic techniques. 
The average percentage of cases reported as im- 
proved (i.e., cured, improved, much improved, 
adjusted, well, etc.) is about 65. Eysenck’s 
control or baseline data estimating the remis- 
sion rate in the absence of formal psycho- 
therapy come from two sources. Those of 
Landis (32) for hospitalized neurotics, and 
those of Denker (14) for neurotics treated at 
home by general practitioners, show similar 
remission rates of about 70% for a 2-year pe- 
riod. Comparing these figures with the aver- 
age for the treated cases, Eysenck concluded, 
“. . . roughly two-thirds of a group of neu- 
rotic patients will recover or improve to a 
marked extent within about two years of the 
onset of their illness, whether they are treated 
by means of psychotherapy or not” (16, p. 
322). He concludes further that “the figures 
fail to support the hypothesis that psycho- 
therapy facilitates recovery from neurotic dis- 
order” (16, p. 323). 

The difficulties attending an evaluation of 
psychotherapy have been detailed many times, 
most recently by Rosenzweig (47) in a 


1 The data, however, are not quite as “remarkably 
stable from one investigation to another” as Eysenck 
appears to believe. The 19 reports of the results of 
eclectic therapy differ significantly among themselves 
when frequencies of improvement and nonimprove- 
ment are compared. A chi square is 38.11 with a 
p beyond the .01 level for 18 degrees of freedom. 
Eysenck’s point is nonetheless basically reasonable; 
the range of per cent improvement of from 41 to 77 
represents considerable stability when one considers 
the differences in population, chronology, treatment, 
classification, and terminology among the studies. 


critique of Eysenck’s findings. Other thought- 
ful and well-organized delineations of evalua- 
tion problems include those of Thorne (50), 
Zubin (56, 57), and Greenhill (22), among 
others. It is not within the province of the 
present paper to repeat these accounts. 

The purpose of this paper is to summarize 
available reports of the results of psychother- 
apy with children using Eysenck’s article (16) 
as a model.” Certain departures will be neces- 
sitated by the nature of the data, but in the 
main, the form will follow that of Eysenck. 


Baseline and Unit of Measurement 


As in Eysenck’s study, the “unit of meas- 
urement” used here will be evaluations of the 
degree of improvement of the patient by con- 
cerned clinicians. Individuals listed as “much 
improved, improved, partially improved, suc- 
cessful, partially successful, adjusted, partially 
adjusted, satisfactory,” etc., will be grouped 
under the general heading of Improved. The 
Unimproved cases were found in groupings 
like “slightly improved, unimproved, unad- 
justed, failure, worse,” etc. 

The use of the discharge rate of children’s 
wards in state hospitals as a baseline for 
evaluating the effects of psychotherapy is not 
recommended. It is most likely that hospital- 
ized children are initially more disturbed than 


2 Compendia similar to, and overlapping Eysenck’s 
have been published by Zubin (57) and by Miles, 
Barrabee, and Finesinger (39). These tend to be 
more detailed and descriptive. Eysenck’s work is 
most concise; in it, descriptions and discussions of 
individual studies have been subordinated to the 
presentation of overall results. The present writer 
feels that this is the most provocative, and hence 
most fruitful, way of evaluating a collection of psy- 
chotherapeutic results. 
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those brought to the child guidance clinics 
and family service agencies from which the 
data on treatment are drawn. Few guidance 
clinics or family service agencies accept psy- 
chotic children for treatment, tending instead 
to refer them to the state hospital. Further- 
more, as Rosenzweig (47) points out, the cri- 
teria for discharge from a state hospital are 
probably less stringent than those leading to 
an appraisal of Improved by other agencies. 
For these reasons, available statistics of state 
hospital populations such as those of Witmer 
(52), McFie (38), and Robins and O’Neal 
(46) are not used as baseline data. 

Follow-up evaluations of changes in behav- 
ior problems in normal children also do not 
furnish satisfactory control data. Studies such 
as those of McFie (38) and Cummings (12) 
report markedly conflicting results, probably 
as a function of differences in ages of the sub- 
jects, and of varying follow-up intervals. More 
importantly, behavior like nai! biting and nose 
picking can hardly be regarded as comparable 
to the problems for which children are re- 
ferred to guidance clinics. 

The use of a follow-up control group of 
cases closed as unsuccessful, as in the study 
of Shirley, Baum, and Polsky (49), suffers 
from obvious weaknesses. Such a group is not 
comparable to an untreated sample; it ap- 
pears to represent the segment of the treat- 
ment population for which a poor prognosis 
has been already established. 

A common phenomenon of the child guid- 
ance clinic is the patient who is accepted for 
treatment, but who voluntarily breaks off the 
clinic relationship without ever being treated. 
In institutions where the service load is heavy 
and the waiting period between acceptance 
and onset of treatment may range up to 6 
months, this group of patients is often quite 
large. Theoretically, they have the charac- 
teristics of an adequate control group. So far 
as is known, they are similar to treated groups 
in every respect except for the factor of treat- 
ment itself. 

Nevertheless, the use of this type of group 
as a control is not common in follow-up evalu- 
ations of the efficacy of treatment. Three stud- 
ies report follow-up data on such groups. Of 
these, the data of Morris and Soroker (40) 
are not suitable for the purposes of this paper. 


Of their 72 cases, at least 11 had treatment 
elsewhere between the last formal contact with 
the clinic and the point of evaluation, while 
an indeterminate number had problems too 
minor to warrant clinic treatment. 

The samples in the remaining two studies 
appear satisfactory as sources of baseline data. 
Witmer and Keller (55) appraised their group 
8 to 13 years after clinic treatment, and re- 
ported that 78% were Improved. In the Lehr- 
man study (34), a one-year follow-up interval 
found 70% Improved. The overall rate of im- 
provement for 160 cases in both reports is 
72.5%. This figure will be used as the base- 
line for evaluating the results of treatment of 
children. 


The Results of Psychotherapy 


Studies showing outcome at close of treat- 
ment are not distinguished from follow-up 
studies in Eysenck’s aggregation. The distinc- 
tion seems logical, and is also meaningful in 
the predictive sense, as the analyses of this 
paper will indicate. Of the reports providing 
data for the present evaluation, thirteen pre- 
sent data at close, twelve give follow-up re- 
sults, and five furnish both types, making a 
total of eighteen evaluations at close and 
seventeen at follow-up. The data of two re- 
ports (29, 30) are based on a combined close- 
follow-up rating. Results for the three kinds 
of evaluations will be presented separately. 

The age range covered by all studies is from 
preschool to 21 years at the time of original 
clinic contact, the customary juncture for the 
determination of age for the descriptive data. 
However, very few patients were over 18 years 
at that time, and not many were over 17. 
The median age, roughly estimated from the 
ranges, would be about 10 years. 

The usual psychiatric classification of men- 
tal illnesses is not always appropriate for 
childhood disorders. The writer has attempted 
to include only cases which would crudely be 
termed neuroses, by eliminating the data on 
delinquents, mental defectives, and psychotics 
whenever possible. The latter two groups con- 
stituted a very small proportion of the clinic 
cases. The proportion of delinquent cases is 
also small at some clinics but fairly large at 
others. Since the data as presented were not 
always amenable to these excisions, an un- 
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known number of delinquent cases are in- 
cluded. However, the outcomes for the sepa- 
rated delinquents are much the same as those 
for the entire included group. 

As in Eysenck’s study, a number of reports 
were excluded here for various reasons. The 
investigations of Healy and Bronner (24), 
Feiker (18), Ellis (15), Mann (37), and 
Giddings (20) were eliminated because of 
overlap, partial overlap, or suspected overlap 
of the sample with samples of included re- 
ports. Those of Bennett and Rogers (3), Rich 
(45), Hunt, Blenkner, and Kogan (27), 
Schiffmann and Olson (48), and Heckman 
and Stone (25) were not useable either be- 
cause of peculiar or inadequate presentation 
of data, or because results for children and 
adults were inseparable. 

The number of categories in which patients 
were Classified varied from study to study. 
Most used either a three-, four- or five-point 
scale. A few used only two categories, while 
one had twelve. Classification systems with 
more than five points were compressed into 
smaller scales. The data are presented tabu- 
larly in their original form, but the totals are 
pooled into three categories, Much Improved, 
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Partially Improved, and Unimproved. A sum- 
mation of the former two categories gives the 
frequency of Improved Cases. 

A summary of results at close is shown in 
Table 1. Results of follow-up evaluations are 
summarized in Table 2, while the results from 
two studies using a combined close-follow-up 
evaluation are presented in Table 3. In the 
latter two tables, the follow-up interval is 
given as a range of years, the usual form of 
presentation in the studies. An attempt has 
been made to compute an average interval 
per case, using the midpoint of the range as 
a median when necessary. These averages are 
tenuous since it cannot be safely assumed 
that the midpoint actually is the median 
value. For example, in the Healy-Bronner in- 
vestigation (23), the range of intervals is 1 
to 20 years, but the median is given as 21 
years. Since the proportion of cases which can 
be located is likely to vary inversely with the 
number of years of last clinic contact, the 
averages of 4.8 years for the follow-up studies 
and 2.3 years for the close-follow-up studies 
are probably overestimates. 

Table 1 shows that the average percentage 
of improvement, i.e., the combined percent- 


Table 1 


Summary of Results of Psychotherapy with Children At Close 








Much 

Study N Improved 
(11) 57 16 
(26) 100 13 
(28) 70 12 
(44) 250 54 
(34) 196 76 
(31) 50 15 
(10) 126 25 
(53) 290 75 
(2) 814 207 
(43) 72 26 
(33) 196 93 
(6) 27 5 
(9) 31 13 
(8) 23 2 
(7) 75 35 
(1) 80 31 
(35) 522 225 
(13) 420 251 
All cases 3,399 1,174 
Per cent 100.00 34.54 








Partially Per cent 
Improved Unimproved improved 
18 12 8 3 80.7 
18 42 26 1 73.0 
29 19 10 Qo 7 
82 16 68 72.8 
5? 68 65.3 
18 17 66.0 
54 47 62.7 
154 61 79.0 
398 209 74.3 
31 15 79.2 
61 42 78.6 
11 11 59.3 
8 10 67.7 
i) 12 47.8 
22 18 76.0 
21 28 65.0 
297 43.1 
169 590.8 


105 1,120 


32.95 


67.05 
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Table 2 





Summary of Results of Psychotherapy with Children at Follow-up 








Interval Much Partially Per cent 
Study in years N Improved Improved Unimproved improved 
(33) 1-5 197 49 55 39 38 16 72.6 
(5) 2 33 8 11 7 6 1 78.8 
(11) 2-3 57 25 17 6 6 84.2 
(52)s 1-10 366 81 78 106 101 72.4 
(28) 2-3 70 21 30 13 6 91.4 
(51) 5-8 17 7 3 4 3 58.8 
(34) 1 196 99 46 51 74.0 
(41) 16-27 34 22 11 1 97.1 
(2) 1-20 705 358 225 122 82.7 
(4) 5-18 650 355 181 114 82.5 
(36) 3-15 484 111 264 109 77.5 
(19) i-4 732 179 398 155 78.8 
(13) 5 359 228 80 51 85.8 
(21) 1-2 25 6 12 7 72.0 
(42) 1-2 25 10 6 9 64.0 
(35) 4-13 191 82 109 42.9 
(23) 1-20 78 71 7 91.0 
All cases 4.8> 4,219 1,712 1,588 919 78.22 
Per cent 100.00 40.58 37.64 21.78 
* Data based on 13 studies originally reported in (54); results of 8 of these are included here. 


b Estimated average follow-up interval per case. 


ages in the Much Improved and Partially Im- 
proved categories is 67.05 at close. It is not 
quite accurate to say that the data are con- 
sistent from study to study. A chi-square 
analysis of improvement and unimprovement 
yields a value of 230.37, which is significant 
beyond the .001 level for 17 df. However, as 
in the case of Eysenck’s data, there is a con- 
siderable amount of consistency considering 
the interstudy differences in methodology, 
definition, etc. 

The average percentage of improvement in 
the follow-up studies is given in Table 2 as 


78.22. The percentage for the combined close- 
follow-up evaluations is 73.98, roughly be- 
tween the other two. The percentage of im- 
provement in the control studies was 72.5, 
slightly higher than the improvement at close 
and slightly lower than at follow-up. It would 
appear that treated children are no better off 
at close than untreated children, but that 
they continue to improve over the years and 
eventually surpass the untreated group. 

This conclusion is probably specious, per- 
haps unfortunately. One of the two control 
studies was an evaluation one year after the 


Table 3 


Summary of Results of Psychotherapy with Children Based on Combined Close—Follow-up Evaluation 








Interval Much 
Study in years N Improved 
(29) 1-10 339 94 
(30) 1-10 30 9 
All cases 5.5 369 103 
Per cent 100.00 27.91 


* Estimated average follow-up interval per“case. 


Per cent 
improved 


Partially 
Improved Unimproved 


81 76 42 46 74.04 

13 8 73.33 

170 96 73.98 
46.07 26.02 


A 


aN A Cacia 
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last clinic contact, the other 8 to 13 years 
after. The former study reports only 70% 
improvement while the longer interval pro- 
vided 78% improvement. The figure for the 
one-year interval is similar to the results at 
close, while the percentage of improvement 
for the control with the 8- to 13-year interval 
is almost identical with that for the follow-up 
studies. 

The point of the analysis is more easily 
seen if the results at close and at follow-up 
are pooled. This combination gives the same 
sort of estimate as that furnished by the two 
control groups pooled since one of them is a 
long-interval follow-up while the other was 
examined only a short time after clinic con- 
tact. The pooled percentage of improvement 
based on 7,987 cases in both close and fol- 
low-up studies is 73.27, which is practically 
the same as the percentage of 72.5 for the 
controls. 

It now appears that Eysenck’s conclusion 
concerning the data for adult psychotherapy 
is applicable to children as well; the results 
do not support the hypothesis that recovery 
from neurotic disorder is facilitated by psy- 
chotherapy. 

The discrepancy between results at close 
and at follow-up suggests that time is a fac- 
tor in improvement. Denker’s report (14) 
also indicated the operation of a time factor. 
He found that 45% of the patients had re- 
covered by the end of one year, 72% had re- 
covered by the end of two years, 82% by 
three years, 87% by four years, and 91% by 
five years. The rate of improvement as a 
function of time in Denker’s data is clearly 
negatively accelerating. 

A Spearman rank-order correlation between 
estimated median follow-up interval and per- 
centage of improvement in the 17 studies in 
Table 2 is .48, p = .05. This estimate of re- 
lationship should be viewed with caution be- 
cause of the aforementioned difficulty in de- 
termining median intervals. However, it is 
uncorrected for tied ranks, which tends to 
make it a conservative null test. It is also, of 
course, insensitive to the curve of the bivariate 
distribution. 

The percentage of improvement as a func- 
tion of time interval is shown by the data of 
Table 4. The studies have been grouped at 


five time-interval points in the table. There 
are four studies with estimated median inter- 
vals of 1-14 years, six with intervals of 2-24 
years, three with 5-6} years, two with 10 
years, and two with 12 years. 

The data of Table 4 indicate that most 
of the correlation between improvement and 
time-interval is accounted for by the studies 
with the shortest intervals, and those with the 
largest. The curve is more or less the same as 
that of Denker’s data, negatively accelerating 
with most of the improvement accomplished 
by 24 years. It is peculiar that the improve- 
ment after 1} years is about 60%, less than 
the 67% improvement at close. However, the 
difference is not too great to attribute to vari 
ations in methodology and sampling among 
the concerned studies. Another potential ex- 
planation will be offered shortly. 

This analysis suggests that improvement is 
in part a function of time, though the mecha- 
nisms involved remain purely speculative. Fu- 
ture comparisons of the results of psycho- 
therapy should properly take this factor into 
consideration. 

Inspection of the data in Table 1 discloses 
another potential factor in the improvement 
rate. The studies in which only two rating 
categories, improved and unimproved, have 
been used, appear to furnish lower percent- 
ages of improvement than the average. In the 
two reports of this kind in Table 1, 
erage improvement is only 50.5% compared 
with the overall 67%. A complete analysis of 
percentage of improvement as a function of 
number of categories is shown in Table 5 


the av- 





Table 4 
Improvement as a Function of the Interval Betweer 
Last Clinic Contact and Follow-up 
Estimated 
median Number 
interval of Total N Per cent 
in years reports N improved improved 
1-14 4 437 261 59.73 
2-23 6 1,167 929 79.61 
5-6} 3 742 583 78.57 
10 2 1,189 958 80.57 
12 2 684 569 83.19 
Allcases 17 4,219 3,300 78.22 








Table 5 


Improvement as a Function of the Number of Points_ 
on the Rating Scale in Evaluation at Close 





Number Number 
of of Total N Per cent 
points reports N improved improved 
2 2 942 476 50.53 
3 12 1,980 1,442 72.83 
4 2 320 242 75.63 
5 2 157 119 75.80 
All cases 18 3,399 2,279 67.05 


Examination of Table 5 indicates that three-, 
four- and five-point rating scales produce 
about the same percentage of improvement. 
The use of a two-point scale, however, re- 
sults in over 20% less improvement than the 
others.* This kind of analysis cannot be ap- 
plied to the data in Table 2 since it will be 
confounded by the time factor. 

Evidently, a certain proportion of the un- 
improved cases in the studies using two cate- 
gories would have fallen in partially improved 
categories if they had been utilized. A number 
of cases in which a fair amount of improve- 
ment was manifested are forced into the un- 
improved category when central points are not 
available. A two-point scale thus seems to be 
overly coarse. It is desirable that finer scales 
be used in future evaluation studies. 

The study of Maas et al. (35), which fur- 
nishes three-quarters of the cases in the 1—14 
year interval group in Table 4, used a two- 
point scale. The percentage of improvement 
is only 43, which may account for the fact 
that this time-interval group has a lower per- 
centage of improvement than in the studies 
at close. 

There are a number of different kinds of 
therapies which have been used in the stud- 
ies reported here. The therapists have been 
psychiatrists, social workers, and teams of cli- 
nicians operating at different points in the 


® The marked difference between the two-point 
scale studies and those using finer scales is reflected 
in the consistency analysis. The chi square for 17 df 
was 230.37, but when the two-category studies are 
eliminated, it falls to 52.66 for 15 df. The value is 
significant beyond the .01 level, but the original chi 
square has been decreased by more than 75% with a 
loss of only two df. 
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patient’s milieu. Therapeutic approaches in- 
cluded counseling, guidance, placement,. and 
recommendations to schools and parents, as 
well as deeper level therapies. In some in- 
stances the patient alone was the focus of at- 
tention. In others, parents and siblings were 
also treated. The studies apparently encom- 
passed a variety of theoretical viewpoints, al- 
though these are not usually specified. Viewed 
as a body, the studies providing the data for 
Tables 1, 2, and 3 are therapeutically eclectic, 
a plurality, perhaps, reflecting psychoanalytic 
approaches. 

Thus we may say that the therapeutic eclec- 
ticism, the number of subjects, the results, 
and the conclusions of this paper are mark- 
edly similar to those of Eysenck’s study. Two- 
thirds of the patients examined at close and 
about three-quarters seen in follow-up have 
improved. Approximately the same percent- 
ages of improvement are found for compa- 
rable groups of untreated children. 

As Eysenck pointed out (17) in a sequel to 
his evaluation, such appraisal does not prove 
that psychotherapy is futile. The present 
evaluation of. child psychotherapy, like its 
adult counterpart, fails to support the hy- 
pothesis that treatment is effective, but it 
does not force the acceptance of a contrary 
hypothesis. The distinction is an important 
one, especially in view of the differences 
among the concerned studies, and their gen- 
erally poor caliber of methodology and analy- 
sis. Until additional evidence from  well- 
planned investigations becomes available, a 
cautious, tongue-in-cheek attitude toward 
child psychotherapy is recommended. 


Summary 


A survey of eighteen reports of evaluations 
at close, and seventeen at follow-up, was com- 
pared with similar evaluations of untreated 
children. Two-thirds of the evaluations at 
close, and three-quarters at follow-up, showed 
improvement. Roughly the same percentages 
were found for the respective control groups. 
A crude analysis indicates that time is a fac- 
tor in improvement in the follow-up studies; 
the rate of improvement with time is nega- 
tively accelerating. Further analysis contra- 
indicates the use of only two categories in 
evaluation. This scale tends to give much 
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lower rates of improvement than three-, four-, 
and five-point scales. 


It is concluded that the results of the pres- 


ent study fail to support the view that psy- 
chotherapy with “neurotic” children is effec- 
tive. 


Received August 20, 1956. 
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The acquisition by the child of normal sex- 
role behavior is a fundamental aspect of to- 
tal personality development and adjustment. 
There are two major reasons why a better un- 
derstanding is needed of the process by which 
a little girl comes to adopt the feminine role 
and learns how to be a “woman” and a little 
boy comes to adopt the masculine role and 
learns how to be a “man.” One reason is theo- 
retical and the other is practical. While there 
is an abundance of speculation, based largely 
on adults, concerning the nature and dynam- 
ics of sex-role adjustment, reviews of the lit- 
erature on sex differences (1, 11) show an 
almost complete absence of studies that spe- 
cifically deal with the problem of sex-role de- 
velopment in children. The practical need for 
data in this area comes from the increasing 
recognition by workers in clinical psychology 
and psychiatry that difficulties or distortions 
in sex-role adjustment appear to be function- 
ally related to the occurrence of personality 
maladjustments and certain forms of emo- 
tional disorders. This suggests a direct link 
between childhood learning and development 
in sex-role behavior anc adult personality dis- 
turbances. 

In a previous investigation (2) the author 
reported on the development of a masculinity- 
femininity scale for use with children, called 
the Jt Scale for Children (ITS).* Results ob- 
tained from the use of this scale with 78 male 
and 68 female kindergarten subjects were pre- 
sented and discussed in terms of the identifi- 
cation process, sex differences in the accept- 
ance of appropriate sex roles, conflicts in sex 


1 This article is based on a paper read at the an- 
nual meeting of the American Psychological Associa- 
tion, Chicago, 1956. 

2 Published by Psychological Test Specialists, Box 
11, Grand Forks, North Dakota. 


roles, and homosexuality in relation to sex- 
role development. 


The Present Investigation 


The present paper represents an extension 
of the original research on the development 
of masculinity-femininity patterns in chil- 
dren (2). Whereas the initial investigation 
was concerned with a single age group from 
about 54 to 64 years, the present study in- 
volves a considerably larger age range, thus 
making possible the exploration of the factor 
of age in relation to sex-role adjustment. 


The Problem 


The concern of the present study is to pro- 
vide an analysis of the projected preferences 
of male and female children for various as- 
pects of the masculine and feminine roles. 
The concept, sex-role preference, refers to the 
degree that one or the other sex role is pre- 
ferred by the child and may be operationally 
defined on the basis of preferential responses 
of children to objects and figures that are 
typical of one sex in contrast to the other sex. 


Subjects 


Six hundred and thirteen children, 303 boys 
and 310 girls, between the ages of approxi- 
mately 54 and 114 were used as Ss and were 
tested in the spring of 1955.° These children 
constituted all pupils enrolled in classes from 
kindergarten through the fifth grade in the 
Pleasanton, California, Elementary School.‘ 


38 Acknowledgment is made to Mr. W. V. Speaks, 
formerly Al1/C, US.A.F., Parks Air Force Base, 
California, who administered the Jt Scale for Chil- 
dren to each of the 613 subjects. 

* Acknowledgment is made to Mr. John C. Mann, 
Assistant Superintendent, to Mr. Thomas S. Hart, 
Principal, and to the teachers of the Pleasanton Ele- 
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The mean IQs of classes, second through fifth 
grade, based on the California Short-Form 
Test of Mental Maturity, ranged from 103 to 
106. The socioeconomic status of the families 
of most of the Ss ranged from lower-middle 
to upper-middle. Approximately 33% were 
from families in which the father was on ac- 
tive duty at nearby Parks Air Force Base, 
and another 10% were from families in which 
the father held a position with the Federal or 
state government (Military, VA Hospital, or 
Atomic Energy Laboratory). 


Procedure 


The Jt Scale for Children was given on an 
individual basis to each S. This scale is made 
up of 36 picture cards, 3 by 4, of objects and 
figures socially defined and identified with the 
masculine or feminine roles in our culture. 
The projective element in the It scale is a 
child-figure referred to as “It,” which is used 
mentary School, California, for their cooperation and 
assistance that made possible the present study. 
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to facilitate the child’s expression of his or 
her own role preference. The It-figure was in- 
tentionally drawn so that it would be am- 
biguous and relatively unstructured as to 
sexual identity. Each S, rather than being 
asked to choose directly, is asked to make 
choices for It. There are four major sections 
that comprise the scale: 


Toy Pictures Section, made up of sixteen pictures, 
eight male objects (e.g., tractor and rifle) and eight 
female objects (e.g., doll and dishes) to which the 
child responds by having It make a total of eight 
choices. Each choice of a male item is scored one 
point, each choice of a female item is scored zero. 

Eight Paired Pictures Section, made up of eight 
pairs of pictures of masculine and feminine alterna- 
tives (e.g., Indian Chief and Indian Princess, Cos- 
metic Articles and Shaving Articles, etc.) to which 
the child responds by having It choose the one of 
each pair that It would rather be, have or wear. 
Each choice of a male item is scored eight points, 
each choice of a female item is scored zero. 

Four Child-Figures Section, made up of pictures 
of four children: a girl, a girlish boy (boy dressed 
as a girl), a boyish girl (girl dressed as a boy) and 
a boy, to which the child responds by having It 


Table 1 


Group Scores, Variability and Differences by Grade and Sex in Masculinity-Femininity Preference 


Grade and Sex N Mdn 
Kindergarten 

Boys 44 72.50 

Girls 46 41.16 
First Grade 

Boys 55 77.00 

Girls 73 72.00 
Second Grade 

Boys 52 81.16 

Girls 60 80.21 
Third Grade 

Boys 56 81.46 

Girls 58 79.97 
Fourth Grade 

Boys 51 81.23 

Girls 40 71.16 
Fifth Grade 

Boys 45 80.87 

Girls 33 12.00 





* Values are significant at or beyond the .05 level. 








M SD CR* F* 
66.18 19.29 ” 

42.50 27.93 4.65 2.10 
66.04 25.39 - 
52 07 33.72 2.66 1.76 
17.58 747 3.93 4.18 
57.28 5.12 om . 
77.93 18.70 " 

50.02 32.92 3.75 3.10 
75.98 20.15 

56.40 31.73 ane —— 
76.73 17.05 

22.15 27.92 9.82 2.68 











POAC it 


Masculinity-Femininity Development in Children 199 


choose the one It would rather be. Choice of the 
boy is scored twelve points, of the girlish boy eight 
points, of the boyish girl four points, and of the girl 
zero. 


Parental Role Section,® which involves asking the 
child whether It would rather be a mother or a 
daddy when It grows up. 


Results 


Tables 1 and 2 contain all of the data used 
in the statistical analysis and results reported 
in the present paper. 


Group Sex-Role Patterns as a Whole 


Significant mean differences occur at each 
age level showing that boys score more mas- 
culine and girls more feminine (Table 1). 
This would be expected, since the It scale is 
composed of items associated with one sex in 
contrast to the other sex. Since the scoring of 
the It scale is such that masculine choices are 
credited with points while feminine choices 
are not, a score of 84 represents an exclu- 
sively masculine preference throughout, while 
a score of zero represents an exclusively 
feminine preference. The lower mean score of 
girls compared to boys in each age group in- 
dicates that boys score more masculine than 
girls and, conversely, girls score more femi- 
nine than boys. Even so, it may be noted in 
Table 1 that the median difference between 
boys and girls in the first through third grades 
is quite small, indicating that many girls in 
these grades score very masculine. This fact is 
evident in various other types of analysis dis- 
cussed below. Also indicated is the fact that 
at each age level girls compared to boys are 
significantly more variable in their sex-role 
preference. 

A comparison between test data of the kin- 
dergarten sample of the present study with 
the kindergarten sample in the original in- 
vestigation of 1953 (2) shows that on the 
whole, there is very substantial agreement in 
sex-role preference patterns of these two sam- 
ples that are separated temporally by two 
years, that are different geographically, and 
that are different in terms of percentage of 
subjects from government and military fami- 
lies. 


5 This section was added to the It scale after the 
original investigation (2) had been completed and is 
not included in the scoring reported in the present 
study. 


Group Sex-Role Patterns by Age and Sex 


Boys show a predominantly masculine role 
preference at the kindergarten and first-grade 
levels (ages 54 to 74) and an even stronger 
masculine preference at the second-grade 
through the fifth-grade levels (ages 7 to 114). 
When boys in all of the grades are combined, 
about 63% respond with exclusive or near- 
exclusive masculine preference while only 
about 4% respond with an exclusive or near- 
exclusive feminine preference. 

In sharp contrast to the strong masculine 
role preference of boys, girls as a group do 
not show nearly the same degree of feminine 
role preference. At the kindergarten level, 
girls show what may be described as a 
“mixed” role pattern, i.e., one that is charac- 
terized by relatively equal preference of both 
masculine and feminine elements. Beginning 
at the first grade and extending through the 
fourth grade (ages from about 64 to 104) a 
much stronger preference is expressed for as- 
pects of the masculine role than for the femi- 
nine role. When girls in all grades are com- 
bined, 40% score at or near an exclusively 
masculine score and about 17% score at or 
near an exclusively feminine score. 

A marked change in sex-role preference pat- 
terns occurs in girls in the fifth grade (age 
range from about 9 years 10 months to 11 
years 6 months with a median age of 10 years 
11 months). In contrast to girls in all earlier 
age groups, the fifth-grade girls appear to 
show much less preference for the masculine 
role and express instead a stronger and in- 
creased preference for things that are femi- 
nine. This apparent change in role preference 
in girls at this age level appears so marked 
that evidence from other studies of similar 
age groups should be established before the 
present finding is accepted. In addition, the 
sex-role preference of Ss in the sixth- and 
seventh-grade levels should be investigated. 
An interesting problem here is whether girls 
as a group become more feminine and less 
masculine in terms of preference just prior to 
and during the pubescent-adolescent period. 

Despite this apparent shift toward greater 
acceptance of the feminine role in fifth-grade 
girls, it is still necessary to recognize that 
even in this age group a larger percentage of 
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Percentages of Masculine and Feminine Responses of Boys and Girls to 
Various Sections of the Jt Scale for Children 
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Table 2 
Total % Total % eight 
toys paired pictures 
Grade and ————_——_—— a 
Sex Masc. Fem, Masc. Fem. 
Kindergarten 
Boys 75 25 79 21 
Girls 58 42 50 50 
First Grade 
Boys 78 22 78 22 
Girls 66 34 61 39 
Second Grade 
Boys 93 7 93 7 
Girls 70 30 68 32 
Third Grade 
Boys 91 9 93 7 
Girls 70 30 70 30 
Fourth Grade 
Boys 90 10 91 9 
Girls 65 35 68 32 
Fifth Grade 
Boys . 92 8 92 8 
Girls 37 63 27 73 


child figures parental role 
Boy Mix. Girl Daddy Mother 

71 18 11 77 23 
37 30 33 52 48 
78 6 16 82 18 
59 11 30 64 36 
90 2 8 94 6 
67 10 23 73 27 
89 7 4 95 5 
72 2 26 72 28 
88 4 8 O4 6 
60 12 28 77 23 
82 11 7 93 7 
12 12 76 21 79 





girls compared to boys express a preference 
for the role of the opposite sex. 


Sectional and Item Analysis of the It Scale 
for Children 


Table 2 shows an analysis by sections of 
the It scale. 


Toy Pictures Section. From 75% to 93% of the 
toy item choices of boys in all age groups are for 
masculine objects. In contrast, only 30% to 42% of 
the toy item choices of girls, kindergarten through 
the fourth grade, are for feminine choices. In the 
case of fifth-grade girls, however, 63% of their 
choices are for feminine objects. 

Eight Paired Pictures Section. When this section 
is taken as a whole, from 78% to 93% of all the 
choices of boys in all groups are for masculine al- 
ternatives. For example, on the Indian item, from 
84% to 98% 6f the boys indicate that It would 
rather be the male Indian than the female Indian. 
In the case of girls from kindergarten through the 
fourth grade, only 30% to 50% of the choices are 


for feminine alternatives. And on the Indian item, 
for example, only 25% to 35% indicate that It would 
rather be the female Indian than the male Indian. 
Fifth-grade girls, however, show a very different 
preference pattern in that 73% of their choices are 
for feminine alternatives, and in connection with the 
Indian item, 76% express a preference for It want- 
ing to be the female Indian rather than the male 
Indian. 

Four Child-Figures Section. When given an oppor- 
tunity to express a preference for the “kind” of child 
It would rather be, from 71% to 90% of the boys 
in all age groups indicate that It would rather be a 
boy. Only 4% to 16% indicate that It would rather 
be a girl and from 2% to 18% indicate It would 
rather be a girlish boy or a boyish girl. On the other 
hand, only 23% to 33% of the girls from kinder- 
garten through the fourth grade indicate that It 
would rather be a girl, while 37% to 72% indicate 
a preference for It being a boy, and from 2% to 
30% indicate It would rather be a girlish boy or a 
boyish girl. These percentages are in sharp contrast 
to fifth-grade girls, 76% of whom express a pref- 
erence for It being a girl, only 12% for It being a 
boy, and 12% for It being a boyish girl. 
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Parental Role Section. When asked whether It 
would rather be a mother or a daddy when It grows 
up, 77% to 95% of the boys in all age groups indi- 
cate a preference for It being a daddy, while only 
5% to 23% indicate a preference for It being a 
mother. In the case of girls from kindergarten 
through the fourth grade only 23% to 48% express 
a preference for It becoming a mother, while 52% 
to 77% express a preference for It becoming a daddy. 
A very different parental role preference pattern is 
evident in fifth-grade girls, however, in that 79% in- 
dicate It would rather be a mother and only 21% 
that It would rather be a daddy. 


Theoretical Implications 


The present finding that girls as a group do 
not show nearly the same degree of preference 
for the feminine role that boys show for the 
masculine role is consistent with the results of 
a number of studies of adults in which men 
and women were asked: “Have you sometimes 
wished you were of the opposite sex?” or “If 
you could be born over again, would you 
rather be a man or a woman?” or “Have you 
ever wished that you belonged to the opposite 
sex?” Three such investigations may be cited 
in this connection: Terman’s study (10) of 
792 married couples, the Fortune Survey (3) 
of 1946, and the Gallup Poll (4) of 1955. 
These studies reveal that only between 24% 
and 4% of adult males compared to between 
20% and 31% of adult females in our culture 
state that they have been aware of the desire 
to be of the opposite sex. This sex difference 
in adults, showing five to twelve times as 
many women as men having been conscious 
of the desire to be of the opposite sex, is 
paralleled by results of the present study. Re- 
sponses to the Parental Role Section of the It 
scale may be taken as an example of this fact 
(Table 2). At the kindergarten level more 
than twice as many girls as boys project a 
preference for the parental role of the oppo- 
site sex (52% compared to 23%) and from 
the first through the fifth grade, between three 
and twelve times as many girls as boys show 
a preference for the role of the parent of the 
opposite sex. 

Another problem in relation to which the 
present findings are relevant is that of homo- 
sexuality. Although based almost entirely on 
clinical studies of adults, rather convincing 
evidence has been established that points to 
a functional relationship between sex-role in- 


version and certain forms of homosexuality 
(5, 6, 7, 8, 9). In fact, adult, passive, femi- 
nine male homosexuals and active, masculine 
female homosexuals almost invariably have 
childhoods in which sex-role inversion is a 
prominent feature and in which the usual 
parent-child relationship is such that for one 
reason or another the child is unable to form 
a close identification linkage to the parent of 
the same sex together with having formed an 
excessively strong atiachment to the parent of 
the opposite sex. This points to the assump- 
tion that the inverted personality comes from 
a family constellation in which the bond be- 
tween mother and son or father and daughter 
was abnormally strong, while that between 
father and son or mother and daughter was 
ineffective, weak or nonexistent. In such cases 
the mother is the identification ideal of the 
son, while the father serves as such an ideal 
in the case of the daughter. 

As far as the present study is concerned the 
occurrence of opposite sex-role preference is 
much more common in girls than in boys. 
This does not mean, however, that all girls 
who show a predominant preference for the 
masculine role are necessarily developing in- 
verted personalities. The essential basis of 
inversion (i.e., the process in which an in- 
dividual adopts the psychological identity 
typical of the opposite sex) appears to be 
an early, continuing, emotionally deep-rooted 
identification with, as well as preference for, 
the sex-role of the opposite sex. Expressed 
preference per se for the role of the opposite 
sex may or may not be based on identifica- 
tion with that role (2). Thus, for example, 
if a girl’s basic and underlying identification 
is with the feminine role, the fact that she 
may show a preference for the masculine role 
during childhood does not necessarily indi- 
cate sex-role inversion. It is quite likely that 
many girls prefer much that is associated with 
the masculine role without having formed a 
fundamental identification with that role. 
Furthermore, in our culture, girls are allowed 
and often encouraged to participate in tasks 
and activities that are typical of boys. Girls 
may wear shirts and trousers for example 
even though such clothing is typically identi- 
fied with the male. The converse is not true 
in the case of boys. Thus, severe social cen- 
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sure would result if boys wore dresses or in 
other ways impersonated girls. There are 
many other areas in which girls, in contrast 
to boys, are permitted to take part in activi- 
ties characteristic of the opposite sex. Never- 
theless, if a girl has made a basic and primary 
identification with and continues to prefer the 
masculine role through childhood and into 
adolescence (and conversely with the boy) 
sex-role inversion in adulthood would be ex- 
pected, one aspect of which would be a homo- 
sexual object choice. This whole problem has 
been discussed in greater detail in another 


paper (12). 
Summary 


A masculinity-femininity test, the Z# Scale 
for Children, was administered to 303 boys 
and 310 girls between the ages of approxi- 
mately 54 and 114. These Ss were enrolled in 
classes from kindergarten through the fifth 
grade in the Pleasanton, California, Elemen- 
tary School. The It scale is made up of pic- 
tures of various objects and figures typical of 
and associated with the role of one sex in 
contrast to the role of the other sex. A child- 
figure drawing, referred to as “It,” is used by 
having each S make choices for It. The find- 
ings are as follows: 

1. In each age group the mean score of 
boys is significantly more masculine than the 
mean score of girls and, conversely, the mean 
score of girls is significantly more feminine 
than that of boys. This would be expected 
since the It scale is composed of masculine 
and feminine alternatives, culturally defined. 

2. Girls in all age groups are significantly 
more variable than boys in their sex-role pref- 
erence. 

3. Boys show a much stronger preference 
for the masculine role than girls show for the 


feminine role, particularly from kindergarten 
through the fourth grade. 

4. Girls at the kindergarten level show a 
preference pattern characterized by relatively 
equal preference for masculine and feminine 
elements. 

5. Girls from the first grade through the 
fourth grade show a stronger preference for 
the masculine role than for the feminine role. 

6. In contrast to girls in all earlier grade 
levels, girls in the fifth grade show a pre- 
dominant preference for the feminine role. 

The implications of these findings are re- 
lated to adult studies of opposite sex-role pref- 
erence, to inversion, and to homosexuality. 
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Some Correlates. of Affective Tone of Early Memories 


June Elizabeth Chance 


University of North Carolina 


The present investigation is an attempt to 
test the idea that affective tone of the earliest 
memory which an individual can report will 
reflect a general trend in his personality or- 
ganization. More specifically, it is hypothe- 
sized that the pleasantness or unpleasantness 
of this memory is predictably related to the 
tendency to recall a proportionately greater 
number of successes or failures in an experi- 
mental task, and that the character of the 
first memory (a form of self-report) is re- 
lated to responses given to a self-report per- 
sonality instrument. 


Procedure 


The subjects of this study were 78 under- 
graduate students enrolled in a psychology 
course. For a majority, it was their first course 
in psychology. The experiment was conducted 
during a regular class meeting and took about 
40 minutes. The Ss were first given a sheet of 
paper on which appeared 32 anagrams. The 
scrambled words varied in length from 4 to 6 
letters; all were very common English words. 
The Ss were told: 


We are trying to construct a brief test of intelli- 
gence which will be especially appropriate for col- 
lege students like yourselves. These are some items 
which we believe might be useful for this purpose, 
but we need to know more about them. Please work 
on them carefully and do the best that you can. 
What you are to do is to rearrange each of the sets 
of letters so that they make a word. Do not use for- 
eign words or proper names; each set of letters will 
make a common English word if properly arranged. 
You will have fifteen minutes in which to work. Go 
ahead. 


At the end of 15 minutes the papers were 
collected. While the Ss had been working, the 
experimenter transcribed the list of anagrams 
to the blackboard. Now the correct solution 


for each anagram was written 
Solutions were left on the board 
minutes and then erased. 

Blank sheets of paper were then passed out. 
The Ss were asked to think carefully for a 
few minutes, and then to write as complete a 
description as possible of their first childhood 
memory. After they finished their free de- 
scriptions, they were requested to answer the 
following questions which were written on the 
blackboard. (a) Is this memory primarily a 
pleasant or unpleasant one for you? (b) Ap- 
proximately how old do you think you were 
at the time this event occurred? (c) Were you 
talking well, i.e., able to make most of your 
needs known verbally, at the time of thi 
event? (d) Is there some possibility that what 
you have described could be a dream or some 
thing which someone else told you about 
sometime after it happened rather than a 
memory? 

When the questions had been answered 
papers were collected and another blank sheet 
was passed out. The Ss were then asked to 
write as many words from the list of ana- 
grams (solutions) as they could recall. These 
papers were collected at the end of 10 min- 
utes. The group was thanked for their co 
operation and dismissed. 

A few weeks earlier an abbreviated group 
form of the MMPI had been administered to 
this group by another experimenter.’ This 
form included all of the items of the Welsh A 
and R scales. Welsh (2), on the basis of his 
work in developing the scales, has suggested 
that the A scale measures general maladjust- 
ment with anxiety and dysphoria as the most 
prominent features, while R measures a tend- 
ency toward denial and repression. 


next to it. 
for a few 


1The author wishes to thank Mr. Carl Cochrane 
for making these data available. 
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Hypotheses 


The hypotheses to be tested in this study 
were the following: (a) Ss who report a 
pleasantly-toned early memory tend to recall 
relatively more successes than failures from 
the anagrams task as contrasted with Ss who 
report an unpleasantly-toned memory. (0) Ss 
who report a pleasantly-toned early memory 
tend to have R scores higher than A scores, 
while Ss who report an unpleasantly-toned 
memory tend to have A scores higher than R 
scores. 


Treatment of Data and Results 


Since different Ss successfully completed 
different numbers of anagrams, an analysis 
based on the absolute numbers of successes 
and failures recalled would be misleading. 
Consequently, a score for the number of suc- 
cesses was computed by dividing the number 
of successes recalled by the number of suc- 
cesses obtained on the original task. A simi- 
lar operation was performed to obtain a 
score for recall of failures. These scores are 
designated PRS and PRF, respectively. 

Since PRS and PRF are proportions, the 
scores of an S who had either solved most of 
the anagrams or very few of them might be 
relatively unreliable. For this reason, data 
from Ss who had solved fewer than 6 or more 
than 26 anagrams were eliminated from the 
study. Also data of Ss who expressed doubt as 
to the validity of the memory they gave were 
eliminated. Thirteen Ss of the original group 
were not included in the analysis for these 
two reasons. The data of the remaining 78 
are presented here. 

Pleasantly-toned memories were reported 
by 38 Ss; unpleasantly-toned memories, by 
40. These proportions correspond fairly well 
to those previously reported in the literature 
(1). Ages reported by Ss at which they be- 
lieved the recalled event to have occurred 
varied from 1 to 6 years, with a median of 3 
years. Only 4 Ss believed that they were not 
talking to some degree at the time the re- 
called event took place. Memories ranged 
from complex social interactions such as be- 
ing ostracized or approved by other children, 
through painful accidents and being fright- 
ened, to simple but vivid sensory impressions 
like colors and foods. The S’s categorization 


of the memory as pleasant or unpleasant was 
accepted in all cases, regardless of content. 

In order to test the first hypothesis, PRF 
was subtracted from PRS for each S. (Only 6 
of 78 Ss recalled more failures than successes; 
4 of these 6 were’in the unpleasant memory 
group.) For the resulting values a test of the 
mean difference between the pleasant and un- 
pleasant memory groups was computed. The 
difference was in the predicted direction; the 
value of ¢ obtained was 2.35 (76 df, .05 > 
p> .02). 

To test the second hypothesis, three tests 
were made. First, the mean difference between 
the pleasant and unpleasant memory groups 
in A scores (converted to standard scores on 
the basis of a table presented by Welsh) was 
computed. The value of ¢ cdtained was 1.69 
(76 df, 10> p> .05). While the unpleasant 
memory group had A scores slightly higher 
than those of the pleasant memory group, the 
result is not statistically significant. 

Next, the mean difference between the pleas- 
ant and unpleasant memory groups in R 
scores (standard scores) was computed. The 
value of ¢ equalled 2.08 (df 76, 05 >p> 
02). This difference was significant and in 
the direction predicted, i.e., the pleasant 
memory group had R scores higher than those 
of the unpleasant memory group. 

Third, a chi-square test of the relationship 
between individuals’ A and R scores was com- 
puted (Table 1). Cases were tabulated within 
each memory group depending upon whether 
the A score exceeded the R score or vice 
versa. The value obtained was 8.62 (1 df, .01 
> ~ > .001) which is highly significant. Thus 
individuals in the pleasant memory group 
were more likely to have R scores higher than 
their A scores, while those in the unpleasant 


Table 1 


Chi-Square Test of the Relationship Between Affective 
Tone of Early Memory and Discrepancy 
Between A and R Scores 


Total 


Group A>R R>A 
Pleasant memory group 14 24 38 
Unpleasant memory group 28 12 40 
Total 42 36 78 





x? = 8.62; df = 1; 01 > » > .001. 
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memory group were more likely to have A 
scores higher than their R scores. 

Tests of the mean difference in numbers of 
anagrams originally solved and total number 
of solutions recalled by the two memory 
groups were not significant. 


Discussion 


The study should be considered as a prob- 
lem in selective retention rather than repres- 
sion in the psychoanalytic sense. The con- 
firmation of the first of the hypotheses sug- 
gests that the affective character of what the 
individual reports as his first memory does 
bear a consistent relationship to his tendency 
to recall other things which have some affec- 
tive loading attached. The findings regarding 
the second hypothesis indicate some consist- 
ency in self-report, particularly with respect 
to denial or tendency to ignore that which 
might be discomforting to the person. In a 
peripheral way, the test of the second hy- 
pothesis might be considered to be a valida- 
tion on R scale. 


Summary 


Seventy-eight college students were asked to 
solve a list of anagrams presented as part of 


an intelligence test in the process of construc- 
tion. They were allowed sufficient time to 
complete only a part of the list. They were 
then given the solutions to all of the ana- 
grams. Next, they were asked to describe the 
earliest childhood memory they could recall 
and to state whether this memory was pleas- 
ant or unpleasant. At the end of the experi- 
mental session they were asked to recall as 
many words from the original anagrams list 
as they could. A hypothesis regarding affective 
tone of the memory and relative tendency to 
recall successes or failures was confirmed. A 
second hypothesis of the relationship between 
character of the memory and scores on the 
Welsh A and R scales (MMPI) was also con 
firmed. 
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The purpose of this study was to determine 
whether children varied as their parents did 
in psychological control. The construct, psy- 
chological control, refers to the manner in 
which the individual regulates inner and outer 
tensions or stress. Running on a continuum 
from constriction and inhibition at one end to 
impulsiveness and expressiveness at the other, 
psychological control as a central personality 
variable tells us how the individual manages 
his feelings, how he relates to others and to 
himself, his attitudes toward authority and 
toward ambiguity. 

After preliminary exploration of experimen- 
tal and clinical methods for measuring psy- 
chological control, an 82-item questionnaire 
was developed. The items cover a variety of 
everyday situations and deal with personal 
and interpersonal attitudes judged to be be- 
havioral manifestations of psychological con- 
trol. The independent judgment of four clini- 
cal psychologists, when in unanimous agree- 
ment, was the basis for the original selection 
of the items. 


1An extended report of this study may be ob- 
tained without charge from Joseph Luft, San Fran- 
cisco State College, San Francisco 27, California, or 
for a fee from the American Documentation Insti- 
tute. Order Document No. 5105, remitting $2.00 for 
microfilm or $3.75 for photocopies. 

2 This investigation was supported (in part) by a 
research grant (M-528) from the National Institute 
of Mental Health of the National Institutes of 
Health, Public Health Service. The study was car- 
ried out at Stanford University. Jeanne Block, Bar- 
clay Martin and Eva Shippee collaborated in this 
research. 





Supporting the validity of the instrument, 
significant correlations were found between 
scores on the psychological control question- 
naire and on independent measures such as 
the Berkeley F scale, expansiveness on the 
draw-a-person test, favorable attitude toward 
self as measured on an adjective checklist and 
teachers’ ratings of cooperativeness. Reliabil- 
ity measures varied from .72 to .91. 

The questionnaires were administered to 79 
boys (mean age 14.5) and to their parents, 
and to 25 girls (mean age 15.0) and to their 
parents. 

Results suggested that there was a signifi- 
cant relationship between parents and sons in 
psychological control, somewhat higher be- 
tween sons and mothers than between sons 
and fathers. Fathers in general ran higher 
scores than mothers. When parents varied 
sharply with each other on psychological con- 
trol scores, the sons’ scores were not signifi- 
cantly different from mean scores for boys in 
general. 

Results for girls and their parents were 
quite erratic, and in general were not pre- 
dicted. A high negative correlation (— .60) 
at the .001 level between fathers and daugh- 
ters was noted, while mothers and daughters 
were not significantly related on psychologi- 
cal control scores. 

These results point to a need for more di- 
rect and accurate measurement of ego func- 
tioning within the family setting. 


Brief Report. 
Received January 7, 1957. 
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A Typology of Tests, Projective and Otherwise 
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\ 


To judge from the pages of this and similar 
journals, the projective test movement is here 
to stay. But the rubric projective, once useful 
in mobilizing a reaction against older diag- 
nostic procedures, has been stretched to in- 
clude such a heterogeneous variety of meas- 
ures that its denotational value has become 
attenuated. In a paper on indirect attitude 
measurement (5), the present writer proposed 
a typology of test formats which has turned 
out to be one of the most cited features of 
that paper. Since that typology is now deemed 
to be inadequate, and since the denotational 
problem still remains, this second effort is of- 
fered. From an initial focus on personality 
tests comes three dichotomies, which generate 
eight test types. Five of these types contain 
tests commonly regarded as “projective,” and 
only two are unrelated to personality meas- 
urement. 


The Three Dichotomies 


1. Voluntary vs. objective. In the voluntary 
test the respondent? is given to understand 
that any answer is acceptable, and that there 
is no external criterion of correctness against 
which his answers will be evaluated. He is en- 
couraged in idiosyncrasy and self description. 


The test assignment may state “this is mot a test 
of your ability,” or “there are no right or wrong an- 
swers” or “answer in terms of how you really feel.” 
In contrast, in an objective test the subject is told, 
either explicitly or implicitly, that there is a correct 

1 The term respondent is borrowed from social psy- 
chology to refer to the person from whom data is 
collected, and whose personality is being examined. 
The term subject, or S, as commonly used in experi- 
mental psychology reports is felt to be connotatively 
inappropriate. More clinically oriented terms, such as 
patient or client, are too specialized to designate the 
full range of respondents employed in personality re- 
search. 


answer external to himself, for which he should 
search in selecting his answer. The concepts of “ac 
curacy” and “error” are in the subject’s mind. Phe 
nomenologically he is describing the external, obje 
tive world, although in so doing he is inevitably 


reflecting his idiosyncratic view of that world, and 
can be unselfconsciously “projecting” in an impor 
tant meaning of that word, as will be illustrated 
more fully in the discussion of test types 5 and 6 
below.” 


2. Indirect vs. direct. In the direct test, the 
respondent’s understanding of the purpose of 
the test and the psychologist’s understanding 
are in agreement. Were the respondent to read 
the psychologist’s report of the test results 
none of the topics introduced would surpris¢ 
him. This is obviously so in an achievement 
test given at the end of a course. It is equally 
true for the typical public opinion poll. It is 
probably so for the usual diagnostic interview 
It is so for many interest tests and adjust 
ment inventories. 


2 The distinction here made partially overlaps th 
discussion by Rosenzweig (19) on levels of respons 
to projective tests. His category subjective clearl 


belongs with the voluntary class as here defined, 
emphasizing the subject’s self-conscious focus on 
describing himself. In his projective category, th 


respondent looks away from himself at some “eg: 

neutral” object, as in the phenomenologically objec 

tive orientation of this paper. His category called 
objective is not the same as the present usage, but 
refers to the psychologist’s orientation, and includ 

a behavior-sampling approach not relevant to the 
present discussion. The distinction is also related to 
Cattell’s discussion of varieties of projective tests 
(8). When he classifies certain approaches as a va 
riety of objective tests employing misperception, his 
usage is in agreement with that of the present writer 
This agreement is limited, however, and on many 
points the analyses differ, as when Cattell places 
the TAT and the Tautophone in the same subtype 
In the present analysis, the TAT is voluntary, and 
the Tautophone the most classic example of thi 
objective assignment among projective personality 
measures. 
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In the indirect test, the psychologist interprets the 
responses in terms of dimensions and categories dif- 
ferent from those held in mind by the respondent 
while answering. If a person tells stories to pictures 
under the belief that his thematic creativity is being 
measured, and the psychologist then interprets the 
products as depth projections, the test is indirect. If 
a person expresses his likes and dislikes about a series 
of drawings, and is as a result classified as an oral- 
pessimist, the test is indirect. If a respondent believes 
he is participating in a general survey of the public’s 
opinion on a variety of harmless issues, and gets 
scored, however invalidly, as a paranoid proto-fascist, 
the test is indirect. In general, whenever responses are 
taken as symptoms, rather than as literal information, 
the test is indirect. 

Characteristic of the indirect test is a fagade. By 
this is meant a false assignment to the respondent 
which distracts him from recognizing the test’s true 
purpose and which provides him with a plausible 
reason for cooperating. Initially the TAT had such 
a facade: “This is a test of your creative imagina- 
tion.” The objective test fagade is used in an impor- 
tant class of indirect tests of social attitudes, in which 
the respondent tries to show his knowledge of cur- 
rent events and is scored for the bias he shows in the 
directionality of his errors (5, 14). The expression of 
aesthetic taste, participation in public opinion sur- 
veys, judgments of moral right and wrong, and judg- 
ments of logical consistency all have been used as 
facades in indirect tests of personality, interests, or 
social attitudes. 

The potential ethical problems arising from the ap- 
plication of personality and attitude tests in adminis- 
trative situations (e.g., 21) would seem to center 
around this one dimension of indirection. 


3. Free-response vs. structured. This di- 
chotomy is already well established in the 
classification of personality and attitude as- 
sessment procedures. Typically, the projective 
tests have been open-ended, free, unstruc- 
tured, and have had the virtue of allowing 
the respondent to project his own organiza- 
tion upon the material. 


The free-response format has the advantage of not 
suggesting answers or alternatives to the respondent, 
of not limiting the range of alternatives available, 
nor of artificially expanding it through the sugges- 
tions provided in the prepared alternatives. In the 
multiple-choice Rorschach the respondent can see 
images pointed out to him by the prepared alterna- 
tives which he would never have noticed on his own. 
The structured format was typical of the personality 
and attitude measurement devices of the first flower- 
ing of such tests in the period from 1920 to 1935, 
and hence provides the tradition against which both 
the projective test movement and modern survey re- 
search techniques were revolting (e.g., 11). But even 
this dimension is not uniquely associated with pro- 
jective tests, as the two earliest papers in English 


Campbell 


using the concept of projection in a testing setting 
(15, 7) used structured response formats. 


The Eight Test Types 


1. Voluntary, indirect, free-response. These 
are the classic projective techniques, including 
free association, the Rorschach, the Thematic 
Apperception Test, doll play, drawing, and 
such projective questions as, “What do you 
admire most in people?” or “What is the 
most embarrassing thing you can think of?” 
It is in this category that most inventiveness 
has been shown, and the present paper makes 
no effort to cite even a fraction of the ap- 
propriate studies.* 

2. Voluntary, indirect, structured. In this 
category would be found the multiple-choice 
Rorschach and multiple-choice association 
tests. In addition, indirect questionnaires 
would fall in this cell, such as the F scale 
measure of authoritarian personality trends 
(1). Where Osgood’s semantic differential 
(16) is used to measure indirectly attitudes 
toward parents, and other important figures, 
it belongs in this category, as would a Q-sort 
approach (10) to unconscious identification, 
for example. Humor tests and annoyance in- 
ventories used for indirect diagnostic purposes 
also belong here. The Barron-Welsh art pref- 
erence test (3) is another good example of 
this category. The Blacky test (4) contains 
both free-response and structured features, 
and in part belongs here. 

3. Voluntary, direct, free-response. This 
category is epitomized by sentence-comple- 
tion tests, essay-type questionnaires, the auto- 
biographical assignment frequently given in 
personality research, and the open-ended in- 
terview in public opinion surveys. Of these 
the sentence completion tests, at least, are 
commonly regarded as projective, but are clas- 
sified here in the belief that rarely is the re- 
spondent unaware that he has been revealing 
his own attitudes. 

4. Voluntary, direct, structured. This cate- 


8 No effort has been made to provide bibliographi- 
cal references for the well-known projective tests. 
Such references are available in a number of sources 
(e.g., 2). Where, in the effort to supply adequate 
illustrations of each category, a less well-known test 
is cited, representative bibliographical references are 
provided. 
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gory would include the classic quantitative 
efforts to measure adjustment, personality, 
interests, and attitudes, including the Wood- 
worth inventory, the Thurstone and Likert 
attitude tests, the Strong and Kuder interest 
inventories, the Bernreuter, the MMPI, the 
many biographical inventories, and many 
others. When these are scored with an em- 
pirical key, or are presented in a forced- 
choice format, the respondent may be in the 
dark about the psychologist’s interpretation 
of a particular response. But even in these 
instances, the topics and dimensions of inter- 
pretation used by the psychologist are still 
congruent with the purposes of the test as 
understood by the respondent. Since in gen- 
eral most test constructors tend to introduce 
some efforts at disguise, were too strict an in- 
terpretation placed upon this dimension, the 
category of direct tests would be very small 
indeed. 

5. Objective, indirect, free-response. These 
are projective tests using the objective test 
facade, focusing the respondent’s attention on 
the external world but allowing an unstruc- 
tured response situation. The oldest and most 
used of the projective tests in this category is 
the “Verbal Summator” or “Tautophone” (13, 
19). A recording of indistinct vowel sounds 
is presented with some such instructions as 
“This is a recording of a man talking. He is 
not speaking very plainly, but if you listen 
carefully you will be able to tell what he is 
saying. I’ll play it over and over again, so 
that you can get it, but be sure to tell me as 
soon as you have an idea of what he is say- 
ing.” Subjects almost unanimously accept the 
facade and produce intelligible verbal con- 
tent which they are totally unaware comes 
from themselves. It should be noted that there 
are also several auditory apperception or au- 
ditory association methods which should not 
be confused with the Tautophone and which 
belong clearly in category 1. 


Sherriffs’ Intuition Questionnaire (20) is another 
excellent example of this category. His instructions 
read “Give a probable explanation for the behavior 
indicated in each of the following excerpts from life 
histories taken from a random sample of the popu- 
lation. Include the motivation underlying the behav- 
ior and the origins of the motivation.” This tech- 
nique has seen very successful application by French 
(12) in the measurement of achievement and affilia- 


tion motives. Day’s (9) task of as«xing respondents 
to explain a sample of day dreams is similar 

Rechtschaffen and Mednick’s Autokinetic Word 
Technique (17) belongs here. In the autokineti 
illusion, a single dot of light in an otherwise totally 
dark room appears to move. They presented a ran 
dom series of exposure periods and told the respond 
ents that words were being written by the point of 
light, which they were to report. All of their 1 
spondents “saw” words, and when told about thi 
nature of the experiment were shocked to learn that 
they had themselves fabricated the content 

The assignment to judge the character of person 
presented in photographs can be presented as an ol 
jective task. Common statements in psychology: 


as to the impossibility of making valid judgment 
this kind may create a problem, although 
properly prepared set of materials (e.g., 6) the 
nomenological validity of the task is great en 
so that even college students will accept it 


legitimate objective task 


6. Objective, indirect, structured. For man\ 
of the test formats of type 5, structured forms 
could be prepared. The trait judgments from 
the photographs assignment has been used in 
both free-response and structured forms (e. 
6), and indeed used a structured response in 
its first application by Murray (15). The 
much used error-choice approach to attitude 
measurement (14) is another typical example 
of this category. 

7. Objective, direct, free-response; and 

8. Objective, direct, structured. The three 
dichotomies which have generated the above 
six types of personality test, produce these 
two remaining categories. These turn out to 
be the typical tests of ability or achievement 
in free-response and in structured form, the 
latter category being characteristic of almost 
all standardized tests in these areas. 


Summary 


From a consideration of differences among 
personality measurement approaches, three 
dichotomies or dimensions of distinction have 
been drawn. The joint application of these 
generates eight test types. Of these, six are 
appropriate to the field of personality meas- 
urement. These are: 


Voluntary, Indirect, Free-Response. 
. Voluntary, Indirect, Structured. 
Voluntary, Direct, Free-Response. 
Voluntary, Direct, Structured. 
Objective, Indirect, Free-Response. 
. Objective, Indirect, Structured. 
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Examples are provided for all six types. While 
category 1 contains the most typical projec- 
tive tests, tests called projective are found in 
all except category 4. Categories 5 and 6 are 
the least developed but should be given par- 
ticular attention, as they can involve the un- 
selfconscious projection of personality content 
upon the phenomenologically objective envi- 
ronment. 


Received September 20, 1956. 
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The Rorschach test was introduced 35 years 
ago, but although a considerable amount of 
research has been published on it, the situa- 
tion is still such that its reliability is undeter- 
mined and its validity suspect. The major sup- 
port for the test comes from its extensive use 
in the clinic and from studies which indicate 
that, used globally, it has some value. What it 
is in the test that works, to what extent its 
value is a function of the test per se and of 
the observations and interview material as- 
sociated with it, and why particular scores, in 
isolation or in patterns, measure personality 
in the way they do, if they do, still remains 
to be determined. 

It is sometimes argued that the Rorschach 
test has been insufficiently experimentally 
validated because there are no adequate cri- 
teria of personality against which to validate 
it. However, a first step is indicated. Person- 
ality, whatever else it is, involves the concept 
of stable individual differences through time. 
If everyone were the same as everyone else, 
there would be no need for the word, and if 
everyone varied haphazardly from day to day, 
a table of random numbers would be as good 
a measure as any. Consequently, the most 
fundamental requirement of a test of person- 
ality is that it demonstrate relatively stable 
individual differences in whatever it purports 
to measure. If it cannot do so, there is no 
point in attempting to validate it. In a sense, 


1 This report is based upon a further analysis of 
data obtained by the third author for a master’s 
thesis submitted under the supervision of the senior 
author to the University of Massachusetts. The sec- 
ond author carried out the additional analysis under 
the direction of the senior author. 


this is merely stating that validity cannot bé 
established without test-retest reliability, but 
in light of Rorschach development, this view 
requires emphasis. 

In the present study an attempt is made to 
determine whether responses to inkblots pro- 
vide an adequate measure of stable individual 
differences. In order to obtain several sam- 
ples of responses without repeating the sam« 
test, it was necessary to construct new sets of 
inkblots. Consequently, this study cannot be 
considered a Rorschach study except to thr 
extent that the blots in the Rorschach have 
something in common with inkblots in general 


Method 
Subjects 


In that the research plan required repeated 
testing of the same subjects, it was necessary 
for practical reasons to keep the sample smal! 
Accordingly, 16 volunteer sophomores, equally 
divided among males and females, who were 
enrolled in an introductory course in psychol 
ogy at the University of Massachusetts, were 
used as subjects (Ss). 


Materials and Apparatus 


The testing materials consisted of 100 card 
6 by 5 inches with symmetrical inkblot de 
signs on them. The cards were randomly as- 
signed to ten sets of ten cards each. 


Each card consisted of a large achromatic blot 
above or below a large chromatic blot, a small chro- 
matic blot on both sides of the large achromati 
blot, and a small achromatic blot on both sides of 
the large chromatic blot. The large achromatic blot 
was internally differentiated by superimposing sev 
eral layers of black ink; the large chromatic blot by 
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the use of different colors. The small blots were of 
single colors. In addition to black, each card con- 
tained the colors red, green, brown, and blue, which 
were produced by the application of standard inks. 
An example of a test card is shown in Fig. 1. Within 
sets, the cards were alternately arranged so that half 
the time the large chromatic blot appeared on top 
and half the time on bottom. The cards were de- 
signed in the manner described as the arrangement 
was esthetically pleasing, offered an opportunity for 
several responses per card, provided a relatively equal 
opportunity for responses to colored and noncolored 
areas, and made it possible to produce parallel sets 
composed of any number of cards without undue 
concern about the characteristics of the individual 
cards. 

Location charts, to enable the Ss to indicate to 
what areas they were responding, were made by 
tracing outlines of the blots on mimeograph stencils. 
Standard response sheets provided a place for the 
entry of each response. An opaque projector was 
used to present the blots. 


Procedure 


The Ss were seated within a small desig- 
nated area in order to keep the projected 
stimulus relatively constant. The room was 
dimly illuminated on the sides to provide 
sufficient light for writing. Preceding the first 
session, the following explanation was given: 
“This investigation is concerned with the de- 
termination of what different people see in 
inkblots. It will be necessary to use a great 
many inkblots, and so ten sessions will be re- 
quired. There will be two sessions each week 
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for the next five weeks. At every session a 
series of inkblots will be presented and each 
of you will be requested to record what you 
see in the blots.” Within the five-week pe- 
riod the sessions were approximately equaliy 
spaced. The group was instructed to write 
three responses for each card, which was pre- 
sented for three minutes, and to identify the 
responses on the location sheet. (An attempt 
to obtain an inquiry by following the Group 
Rorschach procedure [4] was abandoned after 
pretesting indicated that many Ss failed to 
comprehend what was expected of them.) 


Scoring 


For the most part the protocols were scored 
by the Klopfer (5) procedure. Shading was 
not scored as the incidence of responses in 
which it clearly entered was too low for 
evaluation, possibly as a result of the ab- 
sence of an inquiry. In scoring color responses, 
color was presumed to have influenced the 
percept if it was directly described as colored 
(e.g., “a blue butterfly”) or if both the con- 
tent and the area strongly indicated color 
(e.g., “blood” to a red area). Klopfer scores 
that were utilized without revision were: M, 
FM, m, F, FC, CF, C, 3C, H, Hd, A, Ad, S. 
Following is a list of the remainder of the 
scores: 


W: Responses to the entire large chromatic or 
achromatic blots. 
w: Responses to the entire small chromatic or 
achromatic blots. 
D: Responses to major differentiated areas of the 
large blots. 
dd: Responses to minute or undifferentiated areas. 
CA: Responses to chromatic areas. 
AA: Responses to achromatic areas 


Form-level: The sum of weights assigned to re- 
sponses according to the degree of differentiation, 
integration, and elaboration involved. (Initially, an 
attempt had been made to consider form accuracy, 
but it soon became apparent that without an inquiry 
such judgments were too subjective.) A basal score 
was assigned by weighting a rejection as 0 (30 re- 
sponses were required), vague form as 1.0 (eg., 
“cloud,” “explosion”), simple form as 1.5 (eg., 
“drop of water,” “leg”), and complex form as 2.0 
(e.g., “human,” “animal’). An additional weight of 
5 was added for each of the following: movement, 
appropriate use of color, integration of two or more 
separate areas, or any other appropriate elaboration. 
A weight of .5 was subtracted for any inappropriate 
elaboration. The maximum weight allowed for any 
single response was 3.5. 
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CF +C: The arithmetical sum of CF and C re- 
sponses. These scores were combined because the fre- 
quency of either, alone, was too low for satisfactory 
evaluation. 

M + FM +m: This sum was investigated in order 
to determine whether the combined activity responses 
were a more satisfactory measure of individual dif- 
ferences then the separate components. There is as 
much intrinsic logic to a combined activity score as 
there is to a combined color score. 

M:(2C+M): This ratio was computed in place 
of the Rorschach M:=C ratio as it is statistically 
more stable. 

(M+ FM + m):(2C+M+FM +m): This ratio 
was investigated in order to determine whether a 
combined activity score in relation to a combined 
color score was a more effective measure of indi- 
vidual differences than the preceding ratio. 

Anx: The Elizur (3) content scale of anxiety. 

Hst: The Elizur content scale of hostility. 


Statistical Analysis 


Each of the scores was treated by a double 
classification analysis of variance which per- 
mitted an evaluation of differences between in- 
dividuals and between testing sessions. Those 
scores whose distributions were markedly 
skewed were transformed by adding .5 to each 
score and extracting the square root of the 
sum. Reliability coefficients were computed by 
using the ratio of the variance between Ss 
minus the error variance, to the variance be- 
tween Ss plus (k — 1) error variance, where 
k is the number of sessions (6). These coeffi- 
cients reflect the degree to which a score 
measures variance due to individual differ- 
ences relative to error variance. The signifi- 
cance of the reliability coefficients was deter- 
mined by the F ratio of the variance between 
Ss to the error variance. 


Results 
Individual Differences 


Separate analyses of variance were per- 
formed for the data on all ten sessions and 
for the data on the first and second sessions. 
Table 1 presents the reliability coefficients 
derived from these analyses. For the results 
on all ten sessions, every score is seen to 
measure individual differences to a significant 
degree. This includes not only the usual 
Rorschach-type scores, but also the location 
scores of responses to chromatic and achro- 
matic areas, the combination scores, and the 
Elizur content scales of anxiety and hostility. 

The finding that all scores measure stable 


individual differences indicates that responses 
to inkblots tell something about individuals, 
but does not indicate the degree to which 
they do. This can be determined by the mag- 
nitude of the reliability coefficients in Table 1, 
which are seen to vary from .20 to .56 when 
based on all the data. The median coefficient 
is .40. The score combining M+ FM +m 
gives a higher reliability coefficient than any 
of its components and also provides a higher 
coefficient when related to the weighted color 
score in place of M, suggesting that this score 
deserves further investigation. None of the 
measures offers sufficient reliability for indi 
vidual prediction. The question might be 
raised as to whether increasing the length of 
the test would make the reliability satisfac 
tory. By the Spearman-Brown formula it is 
found that if length is the only limitation 
utilizing 30 cards in place of 10 would raise 
a reliability coefficient of .50 to .75, which is 
at best, a minimally acceptable figure for in 
dividual prediction. Moreover, the time in 


Table 1 


Reliability Coefficients Based on All Ten Sessior 
and on Session I and IT Only 





Score L...X T&Il Madr 

W 38 16 

a 33 04 4.4 
D M8232 
dd +1 06 2.95 
CA 26 14 14.95 
1A 27 16 15.20 
S; 56 15 1.45 
VM 42 19 3.45 
FM, 27 21 1.45 
mt 23 82 62 
M+FM+m 53 61 5.75 
FC; 20 23 1.32 
(CF+C), 33 60 65 
BW 48 72 1.3¢ 
M:(2C+M) A6 61 72 
(M+FM-+m):(2C+M+FM-+m 52 80 80 
F Al 59 2.20 
Form-level 51 61 5.73 
H, 43 5? 5.23 
A 50 48 945 
Anx .26 39 7.95 
Hst, 0) 4] 170 


Note.—Reliability coefficients were derived from analys 
variance. For session I...X all coefficients were significant 
yond the .001 level. For sessions I & II, the following were sig: 
cant at the .001 level: m, 2C, (Af +F M+m):(2C4+M+FM-+m 
the following at the .01 level: M+FM +m,CF+C,M:(2C4+M 
F, Form level; the following at the .05 level: S, M, A, Ana 

Transformations are indicated by the subscript t. Med 
values represent the average median raw score per sessior 
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volved in obtaining and scoring 90 responses 
would make such a test prohibitive for most 
purposes. 

In order to hold the influence of repeated 
testing to a minimum, the findings on only 
the first and second tests were considered. In 
Table 1 it can be seen that the magnitude of 
the reliability coefficients for sessions I and IT 
ranges from .04 to .82. The median coefficient 
is .48. Fourteen of the 22 coefficients are 
higher for sessions I and II than for sessions 
I through X, which hardly suggests a real 
difference, although what difference there is 
is in favor of the former. The few coefficients 
which are above .70 can best be accounted 
for by the decrease in stability of the coeffi- 
cients as a result of utilizing only 20% of the 
data. In accordance with this consideration, 
ohly about half the coefficients are significant. 


Test and Session Differences 


In order to obtain further information on 
whether repeated testing results in set and 
practice effects, reliability coefficients were 
computed on the data from sessions I and X 
and from sessions IX and X. In the former 
case, significant reliability coefficients at the 
.05 level were obtained for only two scores, 
m and C. In the case of sessions IX and 
X, ten coefficients were significant at the .05 
level as compared to twelve which were found 
for sessions I and II. Apparently, individuals 
performed in a consistent manner to the last 
few as well as the first few tests, but were no 
longer doing the same thing at the end as they 
were at the beginning. This may mean that 
the scores investigated are significantly stable 
only over very short periods of time, or, more 
likely, that the experience of taking some ink- 
blot tests alters the reaction to further tests. 

Differences between sessions can be asso- 
ciated either with differences in the tests or in 
sequential position. Significant test-session dif- 
ferences were found for all seven location 
scores other than the space response. For no 
other score was significance approached, the 
majority of F values falling short of one. In 
order to determine whether the significance 
of the location scores was associated with or- 
derly sequential changes, the mean number 
of responses for each of the significant scores 
was plotted as a function of the number of 
the session. No tendency was found for any 
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of the scores to consistently rise or fall. It 
thus appears that the differences associated 
with the testing sessions were probably a 
function of differences in the inkblot sets. It 
is interesting to consider, if this is the case, 
that the composition of the blots determined 
which areas were responded to but did not 
influence the determinants or content. The 
finding that there was no tendency for any of 
the scores to increase or decrease does not, of 
course, contradict the conclusion that indi- 
viduals developed sets, but indicates that the 
sets were not uniform. 


Discussion 


The finding that responses to the inkblots 
were primarily a function of the individuals 
responding to them rather than of the charac- 
teristics of the blots supports the basic theory 
underlying the use of inkblots as a projective 
technique. Every score investigated, whether 
a standard Rorschach score or one developed 
for this study, measured individual differences 
to a significant degree. However, the degree 
of reliability found was not encouraging. 
Moreover, there was reason to believe that 
adding more cards or requiring more responses 
would not have raised the reliability of any of 
the measures to an acceptable degree unless 
the test was made so lengthy as to be im- 
practical. Possibly this indicates that the ink- 
blot approach to measuring personality can 
never be sufficiently objectified and must re- 
main in the nature of an art. However, such 
a conclusion would be premature for several 
reasons. 

For one, different scores and combinations 
of scores might have yielded different results. 
It may be possible to combine several scores 
relatively low in reliability into composite in- 
dices of higher reliability. In this connection 
it was found that the weighted sum of the 
color scores had a higher reliability than any 
of the individual color scores, and the same 
held for a combination movement score. How- 
ever, the M:(3C + M) ratio was not more 
reliable than its components. The Rorschach 
is a test in which patterns are considered more 
important than individual scores. This is a 
reasonable claim, but if it is to be more than 
an escape behind the skirts of a mistaken 
notion of Gestalt psychology, it is necessary 
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to define the patterns. Once this is done their 
reliability can be assessed. 

Secondly, the low reliabilities may have 
been a function of the relatively homogeneous 
nature of the sample. The degree of sensitivity 
required of a measure is obviously a function 
of the magnitude of the deviations it must 
discriminate. Responses to inkblots could be 
of sufficient reliability to diagnose extremes 
of behavior without being adequate for dis- 
criminating among “normals.” In this respect, 
it is necessary to investigate reliability coeffi- 
cients in different types of populations. 

Thirdly, the fact that the present set of 
inkblots failed to yield adequate reliability 
coefficients does not mean that other inkblot 
tests would not give better results. Unfortu- 
nately, there is no satisfactory reliability 
study of the Rorschach. The split-half tech- 
nique used by a few investigators is not ap- 
propriate (1). In a test-retest study by Eich- 
ler (2) on the Rorschach and Behn-Rorschach 
blots, total responses was not controlled, and 
may be presumed to have inflated the coeffi- 
cients. Nevertheless, he concluded that the 
reliability was inadequate for individual pre- 
diction. There is a need for experimental 
exploration to determine how inkblot tests 
should be constructed and administered in 
order to maximize their efficiency as measures 
of individual differences. It may be that the 
Rorschach Test will not be the final develop- 
ment in the use of inkblots as a measure of 
personality. 

A final possibility is that the personality 
variables presumably measured by inkblot re- 
sponses are themselves not highly stable. How- 
ever, if it can be demonstrated that increas- 
ing the length of the test, or otherwise modi- 
fying it, results in an increase in reliability, 
it would have to be conceded that it is the 
instrument and not the nature of personality 
which is determining the present limits. In 
this connection it would be desirable to in- 
vestigate reliability as a function of time be- 
tween tests to determine to what extent the 
coefficients are affected by changes in time as 
opposed to error of measurement. The latter 
would obviously be an intrinsic limitation of 
the test. As the split-half technique is inap- 
plicable for the standard Rorschach set, it 
would be necessary to use a parallel set or to 


construct new inkblot sets, as in the present 
study. 


Summary 


Personality, whatever else it is, involves 
the concept of stable individual differences 
through time. The most fundamental require- 
ment of a test of personality, therefore, is that 
it demonstrate that it can measure such dif- 
ferences. In order to evaluate responses to ink- 
blots in this respect, 100 specially constructed 
inkblots were randomly assigned to ten sets 
of ten each. This was followed by intensive 
group testing of 16 individuals ten times over 
a five week period. In order to hold total 
number of responses constant, three responses 
per card were required. Several Rorschach 
scores plus some new scores were investigated. 
It was found that all scores measured indi- 
vidual differences significantly beyond chance 
which was interpreted as supporting the basic 
theory underlying the use of inkblots as a 
projective technique. However, reliability co 
efficients were below acceptable standards for 
individual prediction. This was true when the 
coefficients were derived only from the first 
two tests as well as from all ten. There is no 
evidence that Rorschach test-retest reliability 
coefficients would be more favorable if total 
number of responses and memory were con 
trolled. The implications of the present find 
ings and the need for further investigation 
of reliability of responses to inkblots were dis- 
cussed. 


Received October 2, 1956. 
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Anxiety questionnaires tend to show low to 
moderate correlations with ratings of observ- 
able behavior, and moderate to high correla- 
tions with other subjective measures or ques- 
tionnaires. The concept of anxiety, however, 
is so broad, including diverse psychological 
and physiological aspects, that the prediction 
of striking interrelationships seems unwar- 
ranted. 

The present study undertook the analysis 
of the interrelationships among ten variables 
associated with the concept of anxiety. The 
data were collected during an experimental 
evaluation of a tranquilizing drug. Sixty-six 
hospitalized male patients were observed by 
ward personnel during a ten-day premedica- 
tion period. Since a schizophrenic population 
was employed, three measures of severity of 
illness (degree of contact, consistency of re- 
sponse, and type of ward) were included. 
From a psychiatric interview, ratings of the 
S’s report of physical symptoms, observable 
signs of anxiety, and degree of contact were 
obiained. Ward personnel rated patients for 
the extent of disturbed behavior. A count of 
sleep disturbances, blood pressure, and pulse 
readings were made. An anxiety questionnaire 
yielded an anxiety score and the consistency 
of response score. Patients were also asked 
the direct question: “Do you consider your- 
self to be a tense or anxious person?” 


1 Based on a paper read at the meeting of the 
American Psychological Association, Chicago, 1956. 

2An extended report of this study may be ob- 
tained without charge from Harold Wilensky, VA 
Hospital, Montrose, N. Y., or for a fee from the 
American Documentation Institute. Order Document 
No. 5181, remitting $1.25 for microfilm or $1.25 for 
photocopies. 


216 


Tetrachoric correlations were computed 
among the variables and the resulting matrix 
was factored by Thurstone’s complete cen- 
troid method. Two factors were extracted. 
The axes were rotated to oblique simple 
structure. The correlation between factors is 
— 34. 

Factor A loads highly subjective reports of 
anxiety—the anxiety scale, the admission of 
anxiety in response to the direct question, and 
the psychiatrist’s ratings of the subject’s re- 
port of psychological and physical symptoms. 
It tentatively may be identified as experienced 
anxiety. Patients who recognize and readily 
admit to feelings of anxiety do so on paper 
and pencil tasks and in interview situations. 
A moderate loading of the ward behavior 
variable and the low but significant loading 
in the sleep disturbance count suggest that 
the experienced anxiety is in part expressed 
in daily ward living. 

Factor B appears to be defined by the con- 
tact with reality variables and also higher 
blood pressure. The moderate loadings of the 
blood pressure readings on this factor are in 
accord with previous studies. Schizophrenics 
in remission tend to have higher bleod pres- 
sure than the more disturbed patients. The 
psychiatric rating of observable signs of anx- 
iety loads negatively on Factor B. Many 
schizophrenic patients in poorer contact mani- 
fest behavior such as grimaces and manner- 
isms which tend to be rated as tension, al- 
though they do not admit verbally to feelings 
of anxiety. 


Brief Report. 
Received February 25, 1957. 
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Differences in Perceptual and Cognitive Behavior 
as a Function of Experience Type’ 


James Bieri and Susan Messerley 


Harvard University 


The proposition that differences in experi- 
ence type on the Rorschach reflect certain 
trends in personality functioning has been a 
common assumption in clinical practice. In a 
previous paper (3), it was pointed out that 
one way of conceptualizing the experience 
type is in terms of the person’s use of in- 
ternal and external stimulus factors. The in- 
troversive person utilizes internal stimulus 
factors to a greater degree in that he invokes 
aspects of the stimulus situation more proxi- 
mal to himself, while the extratensive person 
utilizes external stimulus factors in that he 
responds more to the physical, “out-there” 
qualities of the blot. The present research is 
designed to establish empirically differences 
in performance on a perceptual and cognitive 
task as a function of the experience type of 
the person. The perceptual problem given to 
our subjects (Ss) was the Gottschaldt Em- 
bedded Figures Test (EFT), as modified by 
Witkin (7). This test requires S$ to perceive 
a simple geometric figure embedded in a 
larger, more complex figure. In their research 
on field dependence and independence, Wit- 
kin and his associates have studied certain 
empirical relationships between the EFT and 
the Rorschach (8). They found that Ss with 
longer solution times on the EFT (field de- 
pendent) had lower “coping” scores on the 
Rorschach, as defined for example by more 
CF and C responses than FC responses. Quick 
perception of the simple figure (field inde- 
pendence), on the other, was associated with 
giving more human movement responses and 
predominantly FC responses. 

In that portion of the current research con- 


1 This study was facilitated by a grant from the 
Laboratory of Social Relations, Harvard University. 


cerning the relationship of the Rorschach ex- 
perience type and the EFT, we assumed that 
introversive Ss on the Rorschach are express- 
ing more conceptual, abstract behavior. In a 
sense, they are not as dependent as extraten 
sive Ss on the available external stimulus fa 

tors, but utilize a more personalized and ab- 
stract problem-solving attitude. Therefore, on 
the EFT, we would expect this conceptual ap 
proach of the introversives to be of benefit in 
finding the simple figure, inasmuch as this 
type of approach has been considered to fa 
cilitate EFT solutions (7). The extratensives, 
by analogous reasoning, we would expect to 
be more dependent on the external stimulus 
factors. To the extent the extratensive Ss de 
pend upon the stimulus field of the Gottschaldt 
card, and do not invoke the more conceptual 
approach of the introversive Ss, we would ex 
pect the former Ss to do more poorly on the 
EFT. 

Prediction 1. Ss with an introversive experi 
ence type have shorter solution times on the 
EFT than do Ss with an extratensive experi 
ence type. 

The cognitive task given to our Ss was a 
modification of the Role Construct Repertory 
Test developed by Kelly (6). In our use of 
this test, we were primarily interested in a 
measure of cognitive complexity. As reported 
elsewhere (4), cognitive complexity is defined 
as the ability to develop alternative interper- 
sonal perceptions from among a group of per- 
sons known to S. The S is asked to make cer- 
tain sortings (described below) using people 
as the objects to be sorted. These sortings are 
assumed to represent available perceptions S 
has of significant others. The greater the num- 
ber of perceptions elicited from S on the sorts, 
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the more cognitively complex is he. The hu- 
man movement response presumably repre- 
sents an ability to go beyond what is im- 
mediately available in the stimulus. We should 
expect, therefore, that cognitive complexity, 
as a measure of response differentiation, 
should relate positively to the tendency to 
have an introversive experience type. This 
tendency should be further reinforced since 
both the cognitive complexity and human 
movement measures involve people as con- 
tent. 

Prediction 2. Ss with an introversive ex- 
perience type have higher cognitive complex- 
ity scores than do Ss with an extratensive ex- 
perience type. 

Method 
Experimental Groups 


Due to the empirical nature of the research, 
it was decided to employ the method of cross 
validation in the experimental procedure. Ac- 
cordingly, after the initial group of Ss was 
run (hereafter referred to as Group 1), a 
second group of Ss (Group 2) was given the 
identical experimental battery. The Group 1 
Ss consisted of 39 female undergraduates from 
Radcliffe College. Group 2 Ss were 23 female 
summer school students who were primarily 
college undergraduates from schools in the 
eastern part of the country. All Ss were paid 
to serve in the research. 


Inkblot Test 


In place of the standard Rorschach test 
procedure, a special modification for obtain- 
ing inkblot reactions was used. This modifica- 
tion consists of using 16 Rorschach D’s which 
were selected on the basis of their judged 
ability to provide both internal and external 
stimulus factors. Such a selection facilitates 
obtaining an adequate number of human 
movement and color responses for measure- 
ment purposes. The number of blots used was 
increased to 16 from the previously used 10 
(3) in order to provide more variable stimu- 
lus material and to reduce the number of 
repetitions. A template placed over the entire 
card exposes only the desired blot portion. 
The S is presented with each blot following 
the usual Rorschach instructions. Only one 
response per blot is elicited. After S has gone 


through the 16 blots, he is told he will see 
them again and is to tell the examiner what 
else it might be. This second trial is followed 
by the usual inquiry procedure. In this way, 
a total of 32 responses is obtained from each 
S. The following are Beck’s (2) notations of 
the specific blot areas which were used, in 
order of administration: 


Card I: entire center “figure” area (D 4). 

Card II: upper left red (D 2). 

Card IV: lower center area (D 1). 

Card III: upper left red inverted (D 2). 

Card V: entire right half without “antenna” (D 4). 

Card VIII: bottom center pink and orange area 
inverted (D 2). 

Card VI: entire left half of card without top pro- 
jection, rotated 90 degrees to right (D 4). 

Card IX: left green figure, card rotated 90 degrees 
to right (D 4). 

Card VII: entire right half of card (D 9). 

Card X: upper left blue area (D 1). 

Card III: left human figure (D 1). 

Card IX: top right orange area (D 3). 

Card IV: lower projection of “heel” and “toe” of 
boot, rotated 90 degrees to the right (D 2). 

Card X: inner blue area (D 6). 

Card V: entire middle, including D 2 and D 3 
(D 7). 

Card III: middle red (D 3). 


Embedded Figures Test 


Because of time limitations in the experi- 
mental procedure, only eight of the 24 em- 
bedded figures used by Witkin were used in 
this study. While some unreliability is intro- 
duced because of the fewer figures, care was 
taken to select those drawings which would 
be representative of the various difficulty lev- 
els reported by Witkin (7). The following are 
Witkin’s notations for the figures used, in or- 
der of administration: F-1; D-2; A-5; G-2; 
E-5; B-1; C-1; and A-2. The EFT was ad- 
ministered with the same instructions used by 
Witkin (7). Each S’s score on the EFT was 
the sum of the times taken to find the simple 
figures in all eight complex figures. Ss were 
given two-minute time limits for the first 
seven figures, and a four-minute limit on the 
final figure. If $ had not located the simple 
figure when the time limit had elapsed, he 
was given a score of either 120 or 240 seconds. 
The time allowances were ample for the ma- 
jority of Ss to find the embedded figures. 
Group 1 Ss had a mean solution time of 
315.78 seconds, with a range of 80 to 805 
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seconds, while Group 2 Ss had a mean solu- 
tion time of 357.04 seconds, with a range of 
99 to 837 seconds. 


Cognitive Complexity 


On this sorting task, S is asked to write the 
names of seven persons known personally to 
him who come closest to matching seven role 
descriptions. The following role descriptions 
were used for the seven persons: 


. Yourself. 

. Your mother. 

. Your father. 

. Your sister closest to you in age, or the person 
most like a sister. 

5. Your current boy-friend, or a boy you know 

well. 
6. Your best friend (of your sex). 
7. Your ex-boy-friend, or a boy you know well. 


> wd = 


This list was designed to sample both 
family members and age-mates of the same 
and opposite sex. The actual sorting consisted 
of placing three of the names at a time in 
front of S and asking her to consider how two 
of the three were alike in some important per- 
sonal characteristic, and different from the 
third in this respect. The S was then asked to 
name the similarity and its opposite, even 
though the opposite did not apply directly to 
the third person. For example, the first sort 
given Ss contained the self, mother and father. 
One S sorted herself and her father together 
as being “selfish,” the opposite of which she 
stated to be “generous.” Thus the mother is 
perceived as more generous than the father 
or S. Each S is given 25 sorts, prearranged so 
that no three persons appear together more 
than once. The score for cognitive complexity 
is obtained by counting the number of dif- 
ferent verbal dimensions given by S on the 
25 sorts. A repetition is counted if either or 
both ends of a dimension are identically re- 
peated on a subsequent sort. Thus, S may re- 
ceive a complexity score ranging from 1 to 25. 
Actually, the scores obtained for Group 1 
ranged from 6 to 25, with a mean of 18.46, 
while scores for Group 2 ranged from 11 to 
25, with a mean of 19.48. 


Results and Discussion 


The inkblot responses for each S were 
scored without knowledge of S’s EFT or cog- 


nitive complexity scores. A satisfactory inter- 
scorer reliability for the inkblot reactions had 
been obtained, as reported in a previous study 
(3). To obtain the experience type of each S, 
the total number of human movement (J) re- 
sponses and the Sum C score were calculated. 
The latter score was derived by weighting the 
various color responses (FC, CF, and C) in 
the usual manner. It was decided to place 
Ss into one of two experience types, either 
introversive (M > Sum C) or extratensive 
(Sum C > M). This was done to enable the 
use of a larger number of Ss in each group 
than would be possible if other experience- 
type groupings were also used (such as co- 
arctated and ambiequal). In order to place 
each § in either the introversive or extraten 
sive group, the original M and Sum C scores 
were converted to standard scores. Such a 
procedure, as Barron (1) and others have 
pointed out, eliminates the otherwise tenuous 
assumption that the blots have equal stimulus 
value in evoking movement and color re- 
sponses. Separate standard score distributions 
were calculated for Group 1 and Group 2. If 
S had a larger standard score for M than for 
Sum C, she was placed in the introversive 
category; if her Sum C standard score was 
larger than her standard score for M, she was 
placed in the extratensive category. One S in 
Group 1 had identical standard scores for M 
and Sum C, and was therefore excluded from 
the analyses for that group. In this way, for 
Group 1, 18 Ss were placed in the M > Sum 
C category, and 20 Ss fell into the Sum C 
M category. Similarly, in Group 2, 11 Ss were 
in the M > Sum C category, and 12 Ss were 
in the Sum C > M category. 

Table 1 presents the mean number of re 
sponses in the various determinant classes 
given by the Ss in the M > Sum C and Sum 
C > M categories of Groups 1 and 2. In ad- 
dition, for purposes of further evaluation, Ss 
from both groups were placed in a total dis- 
tribution, forming Group 1+ 2. This was 
done after it had been established that there 
were no significant over-all differences between 
Group 1 and Group 2 on the Rorschach, the 
EFT, and the cognitive complexity measures. 
New standard scores were derived to place Ss 
in the appropriate experience type for Group 1 
+2 in Table 1. Actually, all Ss remained in 
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Table 1 
Mean Rorschach Scores of the M>Sum C and Sum C>M Categories for the Experimental Groups 





Group 1 
M>SumC SumC>M 

Score (N=18) (N =20) 
M 9.3* 3.8 
FM 3.9 3.0 
m 1.1 0.8 
FP 15.1 16.1 
Sh 2.5 2.8 
FC 1.3 2.4 
CF 0.4 2.4* 
Sum C 1.1 3.8* 


* Higher than mean score of other experience type categor 
** Higher than mean value of this score of other experience 


their original experience-type category except 
for the following changes: Two Ss who were 
originally extratensive in Group 1 became in- 
troversive in Group 1 + 2; two Ss who were 
originally introversive in Group 2 became ex- 
tratensive in Group 1+ 2; and the one § 
who had been ambiequal in Group 1 became 
extratensive in Group 1 + 2. 

For purposes of presentation, the few pure 
color (C) responses given among all the Ss 
were included in Table 1 in the CF determi- 
nant listing. Similarly, all the various Klopfer 
shading scores (FK, KF, k, Fc, etc.) were 
included in the SA category. It can be ob- 
served from Table 1 that the introversive and 
extratensive Ss in Groups 1, 2, and 1+ 2 
differ from each other only in terms of the M, 


M>SumC SumC>M 


Group 1+2 


M>SumC SumC>M 


Group 2 








(N =11) (N=12) (N =29) (N =33) 
8.4* 3.8 9.3** 3.5 
3.6 4.0 3.7 3.4 
0.6 0.5 0.9 0.7 
16.4 14.3 15.1 16.6 
0.6 3.6* 1.9 2.9 
1.4 2.6 1.5 2.3 
0.9 3.0* 0.7 ow 
1.6 4.5* 1.5 3.8* 





yn this variable at or below .05 level of significance (two-tails) 


type category in this group at .01 level of significance (two-tails) 


CF, and Sum C scores. The sole exception is 
the significant difference in Group 2 between 
the experience types on the number of Sh re- 
sponses given. 

Table 2 presents the experimental findings 
for Groups 1, 2, and 1 + 2 relative to the two 
predictions presented earlier in the paper. 
White’s ranking method (5) was used to test 
the significance of the differences between 
means. Inspection of Table 2 indicates that 
the M > Sum C Ss in Group 1 had signifi- 
cantly longer solution times on the EFT than 
did the Sum C > M Ss, and that this differ- 
ence persisted in the cross validation involv- 
ing Group 2. When the Ss are combined into 
Group 1 + 2, we find that the difference be- 
tween the two experience types on EFT per- 


Table 2 


Mean Scores of the M>Sum C and Sum C>M Categories on EFT and Cognitive Complexity 











EFT 


for the Experimental Groups 








Cognitive complexity 











Group 1 363.70* 
(N =16) 

Group 2 430.82* 
(N=11) 

Group 1+2 384.81** 
(N =27) 


M>Sum C Sum C>M 


M>SumC SumC>M 





272.20 16.78 


19.75* 

(N =20) (N =18) (N =20) 
289.92 18.36 20.50* 
(N =12) (N=11) (N =12) 
286.88 17.62 19.94* 
(N =33) (N =29) (N =33) 
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* Higher than mean score of other experience type category on this variable at or below .05 level of significance (two-tails). 
** Higher than mean score of other experience type category on this variable at or below .01 level of significance (two-tails). 











formance becomes even more significant (p 
< .01). These findings are directly opposite 
to those anticipated in Prediction 1 of the 
study. 

The findings relative to Prediction 2 are 
also to be found in Table 2. Here again, the 
results are in the opposite direction from the 
original prediction. In both Groups 1 and 2, 
Ss with the Sum C > M experience type had 
significantly higher cognitive complexity scores 
than did those Ss with the M > Sum C ex- 
perience type. Again, this difference exists 
when the Ss are combined into Group 1 + 2. 

It is apparent that the findings ran con- 
trary to our expectations in both experimen- 
tal groups, and that the findings on the EFT 
run counter in some degree to the previous 
work of Witkin (8). We wish to comment 
briefly upon one factor which appears to have 
contributed to our unexpected results, that is, 
our use of female Ss. It is likely that in mak- 
ing our original predictions we were not suffi- 
ciently aware of the possible importance of 
sex differences in performance on the kinds of 
perceptual and cognitive tasks we employed. 
Witkin has discussed this problem at some 
length (7, 8), and generally finds women less 
effective on the EFT and more variable in 
performance. It is interesting to note that the 
EFT performance of women is reported to 
correlate less highly with Rorschach indices 
than in the case of men. In a previous study 
(4), one of the present writers found that for 
males, M production correlated positively 
with cognitive complexity scores. Work cur- 
rently in progress is designed to investigate 
these problems in more detail. 


Summary 


This study was designed to investigate dif- 
ferences in performance on a perceptual and a 
cognitive task as a function of experience 
type. It was predicted that introversive Ss 
would be able to perceive the simple figure on 
the embedded figures test more quickly than 
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extratensive Ss. Further, introversive Ss were 
predicted to produce a higher score on a cog- 
nitive complexity test which involved an ob- 
ject-sorting task using people as the objects 
to be sorted. The inkblot method used was a 
modification of the regular Rorschach blots in 
which only Rorschach Ds were used. M and 
Sum C scores were converted into standard 
scores in order to place each S in either the 
introversive or extratensive experience-type 
category. Two groups of women college stu- 
dents were used as experimental Ss. The re- 
sults from both groups were in the direction 
opposite from that originally predicted. Ex- 
tratensive Ss perceived the embedded figures 
significantly faster than introversive Ss, and 
the former Ss had significantly higher cog 
nitive complexity scores in their perception of 
people than did the latter group of Ss. The 
relationship of this study with previous re- 
search is discussed, and the importance of the 
role of sex differences in perceptual perform 
ance is suggested. 


Received August 27, 1956. 
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An Item Analysis of the Coloured Progressive 
Matrices (1947) 


Thomas E. Jordan and Carson M. Bennett 
Ball State Teachers College, Muncie, Indiana 


The Coloured Progressive Matrices (2) is 
a series of 36 designs, each of which is in- 
complete. The subject indicates which of a 
series of possible inserts he believes will com- 
plete the design. The scale is intended to 


Table 1 


Index of Discrimination and Level of 
Difficulty of Items 





Level Index Level Index 
of of of of 
Item diff. discrim. Item diff. discrim. 
Al 96% 31 Ab7 24% A9 
A2 91 .22 Ab8 17 .24 
A3 90 .20 Ab9 21 .24 
A4 88 37 Ab10 23 .29 
AS 74 03 Abli 22 28 
A6 76 50 Abi2 10 02 
A7 33 37 Bl 92 16 
A8& 49 .28 B2 71 59 
A9 36 52 B3 42 .70 
A10 29 46 B4 50 54 
All 10 .06 B5 22 46 
Al2 24 .04 B6 28 35 
Abi 88 31 B7 17 15 
Ab2 85 Al B8 4 13 
Ab3 70 AS B9 11 —.02 
Ab4 26 A3 B10 16 13 
Ab5 28 39 Bil 9 05 
Ab6 24 53 B12 5 02 





measure Spearman’s g factor and emphasizes 
the eduction of relationships and correlates. 
The significance of this interesting instru- 
ment is still in doubt since no item analysis 
has been made. As a result of a continuing 


interest on the part of the senior author in 
the Coloured Progressive Matrices as a test 
of intellectual ability for multihandicapped 
children, the authors have undertaken to cor- 
rect this deficiency. 

The Coloured Progressive Matrices was ad- 
ministered to two hundred children who were 
entering first grade and who had no known 
sensory-motor handicaps. The indices of dis- 
crimination were computed using the conven- 
tional upper and lower 27 per cent technique. 
The item difficulties were determined by an 
item count of correct responses. 

The data in Table 1 can be interpreted by 
the application of criteria concerning the qual- 
ity of test items. One such set of criteria has 
been supplied by Ebel (1, pp. 143-152), who 
suggested that an index of discrimination of 
.20 and above is satisfactory and that the 
range 40 to 70 per cent be used as a criterion 
for level of difficulty. When these criteria are 
applied, 25 of the 36 items are satisfactorily 
discriminative, but only 4 fall in the suggested 
difficulty range. Twenty-two items appear to 
be too difficult for the age group studied. The 
data suggest that the test is of less value for 
lower age groups. 


Received January 21, 1957. 
Early Publication. 
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Diagnostic Prediction From Emphasis on the Eye 
and the Ear in Human Figure Drawings 


Ronald I. Ribler’ 


Michigan State University 


In 1949, Karen Machover (1) offered sug- 
gestions on the significance of variability in 
human figure drawings pertinent to person- 
ality. She reported finding that certain as- 
pects of the personality of the individual were 
frequently shown in features of his drawings 
and that specific personality types would be 
more prone to present given #raits in their 
drawings which would enable the clinician to 
make diagnostic differentiation. Among these 
traits were emphasis of eyes and ears, which 
Machover reported, would be found most fre- 
quently in paranoid schizophrenics (1, pp. 
48-50). The purpose of this study was to 
test the following hypothesis which is implicit 
in Machover’s monograph, as follows: para- 
noid schizophrenics, when asked to draw a 
human figure, produce drawings with a 
greater proportion of eye and/or ear empha- 
sis than other diagnostic groups. 

Since Machover (1) did not define “em- 
phasis,” the following definition was used for 
the purposes of this study: (a) unusual place- 
ment, (5) disproportion of eye with relation 
to total figure, (c) differential line quality 
(of eye in terms of shading, etc., in contrast 
to the rest of the figure), (d) excessive detail 
of eye parts and/or eyebrows, (e) marked 
staring quality. Ear emphasis is defined as 
any one or more of the following: (a) un- 
usual placement, (4) disproportion of ear 
with relation to total figure, (c) differential 
line quality (of ear in terms of pressure, 
shading, etc., in contrast to the rest of the 


1 This study was made possible by the cooperation 
of the staff of the Veterans’ Administration Hos- 
pital, Battle Creek, Michigan, and G. M. Gilbert of 
Michigan State University, to whom gratitude is 
here expressed. 


223 


figure), (d) excessive detail of ear parts, (¢) 
marked “listening” quality, (f) ear adorn- 
ments (earrings, etc.). 


Subjects and Procedure 


Subjects. The sample consisted of 120 
males, patients and attendants in a Veterans’ 
Administration neuropsychiatric hospital, di- 
vided into four groups: Group I, 30 patients 
diagnosed paranoid schizophrenia (mean age 
30, mean IQ 100.7). Group II, 41 patients 
diagnosed unclassified schizophrenia (mean 
age 28, mean IQ 100.4). Group ill, 16 pa- 
tients diagnosed anxiety neurosis (mean age 
35, mean IQ 107.4). Group IV, 33 hospital 
attendants (“normal’’ controls, mean age 29, 
mean IQ 85.3). Henmon-Nelson IQ was used 
on the attendant group. Groups I, I, and III 
were chosen as representing a typical sample 
of diagnostic cases seen in neuropsychiatric 
hospitals. In these three groups the Draw-A- 
Person Test was administered as part of the 
regular test battery under standard testing 
conditions (1, pp. 28-29). Group IV was 
tested in a classroom situation, also under 
standard testing procedures (1, p. 105). Two 
drawings were obtained from each subject in 
all of the groups. 

Judgment of drawings. After the drawings 
were collected, the identifying marks were ob- 
literated. The drawings were then numbered, 
randomized, and listed by number on the 
judges’ rating sheets. The age and race of 
the subjects were also listed on the judges’ 
sheets. The procedure for judging was stand- 
ard, the drawings being presented to four 
judges to determine the presence or absence 
of the variables of eye and/or ear emphasis. 
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Table 1 


Reliability of Judges as Determined by Hoyt’s 
Analysis of Variance Technique 











Category Reliability df 
Eye-Drawing 1 .65* 357 
Ear-Drawing 1 .67* 357 


Eye-Drawing 2 .73* 
Ear-Drawing 2 A. 


Note.—Mean reliability is .70. 
* Significant beyond .01. 





Three staff psychologists and one advanced 
graduate student in clinical psychology were 
asked to rate emphasis in the case of each 
drawing. Each judge was supplied with in- 
structions as follows: 


These drawings were gathered from four diagnostic 
groups; “normals” (attendants), anxiety reactions, 
paranoid schizophrenics, and unclassified schizophren- 
ics. The drawings are numbered and arranged in 
serial order from 1 to 120, with the diagnostic groups 
randomly represented. The enclosed sheet has spaces 
numbered from 1 to 120 corresponding to the draw- 
ings. In the appropriate spaces, indicate for each 
drawing in the pair the presence (+) or absence (0) 
of eye emphasis according to the criteria listed be- 
low. Also list the number(s) corresponding to the 
criteria which determined your decision in each case. 


The judges were not aware of the order of 
presentation nor of how many subjects repre- 
sented each diagnostic group. Each judge 
worked independently of the others. 

When the judging was completed, the judg- 
ing sheets were scored. Each indication of 
judged emphasis per judge was scored as one 
point. Since there were two drawings from 
each subject and a judgment of eye and ear 
emphasis on each by the four judges, the 
maximum number of points possible for a 
given subject was sixteen. 

Reliability of judgments. The reliability 
among the judges was determined by two 
methods; percentage of agreement among the 
judges (see Table 2), and Hoyt’s (2) analy- 
sis of variance technique for determining reli- 
ability (see Table 1). 

Finally the sums (total number of judged 
emphases) were tallied and a 2 X 2 chi- 
square table set up to determine whether 
there were significant differences between the 
groups. The data were dichotomized (elimi- 
nating 22 subjects in the mid-range; N-98), 


Ronald I. Ribler 


Table 2 


Frequency Table of Percentage of Agreement 
Among the Judges 








Percentage of 





agreement Frequency 

(%) (f) (%) Y) 
100 19 1900 
87 25 2175 
83 13 1079 
75 20 1500 
71 6 426 
67 6 402 
63 8 504 
58 7 406 
54 5 270 
50 4 200 
46 5 230 
42 2 84 
120 9176 





Note.—Mean percentage of agreement is 76.46. 


using 0-4 points as indicative of no emphasis 
and 8-16 points as indicative of emphasis. 


Results and Conclusions 


An analysis of the data by means of chi 
square (see Table 3) indicates that there were 
no statistically significant differences between 
the diagnostic groups on the variables of eye 
and/or ear emphasis. 

Table 1 indicates that there is a rather high 
degree of agreement among the judges and 
adequate reliability between the judges. With 
the reliability established, and with the sig- 
nificance level of the chi-square analysis, it 
appears that the hypothesis that paranoid 
schizophrenics will produce greater eye and/or 
ear emphasis on human figure drawings is not 
supported by the data. 


Table 3 


Chi-Square Analysis: Judgments of Eye 
and Ear Emphasis 








Noemphasis Emphasis 





Group (04 pts.) (8-16 pts.) Sums 
Paranoids 21 7 28 
Nonparanoids 56 14 70 
Sums 77 21 98 





Chi square = .296; df = 2; p = .90 —.80. 














Diagnostic Prediction from Figure Drawings 


Summary 


One hundred and twenty pairs of Draw-A- 
Person tests were selected in a Veterans’ 
Administration neuropsychiatric hospital to 
determine whether judged eye and/or ear 
emphasis could differentiate paranoid schizo- 
phrenics from unclassified schizophrenics, anx- 
iety neurotics, and “normals.” Scoring vari- 
ables were selected and the protocols were 
submitted to four judges independently of 
each other. The reliability among the judges 
was found to be adequate, as was their agree- 
ment. The variables of emphasis, however, 
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proved to be statistically not significant, with 
reference to diagnosis, suggesting that further 
work is necessary before these variables may 
be used with any degree of confidence by the 
practicing clinician. 


Received July 30, 1956. 
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Reliability and Validity of the n-Achievement Test 
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Air Force Personnel and Training Research Center + 


and William W. Farquhar 
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The n Achievement (6) is a projective test 
designed to measure an individual’s need to 
achieve, his ambition, his drive, his motiva- 
tion, or in other words his desire to succeed 
in competition with some standard of excel- 
lence. In response to picture stimuli the sub- 
ject writes stories which are scored for mo- 
tivational content according to a well-defined 
set of instructions (6, pp. 107-138). 

Much evidence on concurrent and construct 
validity and on interscorer reliability has been 
published (2, 6, 7, 8, 11). However, less evi- 
dence in the way of predictive validity (9, 
10) and test-retest reliability has been re- 
ported. 


Procedure 


As part of an investigation on the effect of 
different teaching methods on outcomes in a 
How To Study course (3, 5), the n Achieve- 
ment was administered to students at the be- 
ginning and again at the end of this one quar- 
ter course at the University of Minnesota. 
Pictures A, B, D, and E (6, p. 375) of the 
n Achievement were administered both times 
in that order. One psychologist scored all 
stories. His scoring of 80 stories by 20 indi- 
viduals (6, pp. 346-374) correlated .81 with 
the scoring of these same stories by the au- 
thors of the scoring system. Furthermore, on 
another sample of 80 stories from the present 
sample scored and rescored after an interval 


1 This research was accomplished while both au- 
thors were at the University of Minnesota. The opin- 
ions and conclusions expressed herein are those of 
the authors. They are not to be construed as neces- 
sarily reflecting the views or indorsement of the Air 
Force or of the Air Research and Development Com- 
mand. 


of a month, a score-rescore reliability of .91 
was obtained by this same psychologist. 

Three other tests which may be considered 
as tentative criterion instruments were also 
administered as pre- and posttests: the Sur- 
vey of Study Habits and Attitudes (SSHA), 
the Opinion Attitude and Interest Survey 
(OAIS), and a 99-item objective achievement 
examination (HTS) based on the content of 
the How To Study course. Both the SSHA 
and the OAIS may be considered as possible 
measures of achievement motivation. The 
SSHA consists of items about attitudes to- 
ward study and study habits and discrimi- 
nates between students with high and low 
grade-point averages (1). The OAIS is a con- 
figurally scored personality inventory which 
has been shown to add unique variance to the 
prediction of honor-point ratio at the Univer- 
sity of Minnesota (4). Both tests identify 
personality type variables differentiating over- 
achievers from underachievers (and presum- 
ably the highly motivated from the less highly 
motivated). The HTS achievement examina- 
tion provides a basis for evaluating actual 
achievement, although its internal consistency 
reliability is not as high as desired (r = .68 
by Hoyt’s analysis of variance technique). To 
determine how independent n Achievement is 
from scholastic aptitude, scores on the Ameri- 
can Council on Education Psychological Ex- 
amination (ACE), obtained from records of 
the Student Counseling Bureau, were corre- 
lated with it. 

The design of the experiment called for stu- 
dents to be randomly assigned to the different 
teaching methods. However, because of sched- 
uling difficulties it was not possible to assign 
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Reliability and Validity of the n-Achievement Test 


Table 1 


The Correlations of n Achievement With Itself and Other Variables 





- z— —_ 

















Random Nonrandom Total Total Total 
Males Males Males Females Males and Females 
N = 80 N = 50 N = 130 N = 39 N = 169 
nAch nAch nAch nAch nAch nAch ndAch nAch nAch nAch 
(Pre) (Post) (Pre) (Post) (Pre) (Post) (Pre) (Post) (Pre) (Post) Mean SD 
n Ach (Pre) 5.2 4.6 
n Ach (Post) .49** 02 at 25 .26** 6.0 4.4 
HTS (Pre) 15 10 —.03 02 09 07 22 —.07 13 — as 73 
HTS (Post) .22* —.03 10°" 03 .18* .00 =f —32 12-02 713 7.7 
SSHA (Pre) —.06 02 — .07 18 —.07 08 12 —.28 —03 —.01 296 108 
SSHA (Post) —.05 13 02 ai? —.02 .20* 17 —.09 02 a ae 662 
OAIS (Pre) .26* 18 —.16 —.07 08 04 17 —.01 10 .03 41.4 10.7 
OAIS (Post) .20 17 —.13 05 05 12 09 —.05 .06 08 424 11.5 
ACE 17 12 —.21 05 .02 09 12 02 05 07 101.0 19.7 





* Significantly different from zero at the .05 level. 
** Significantly different from zero at the .01 level. 


all students randomly. Those who could not 
adjust their schedules to the random assign- 
ment were designated as nonrandom students. 


Results 


The concurrent and predictive validities of 
the n Achievement are reported in Table 1 
along with test-retest reliabilities. 

It will be observed that the test-retest reli- 
ability with a nine-week interval is only .26 
for the total group. This correlation, although 
significant at the .01 level, is only minimal. 
If n Achievement measures the motivation it 
purports to measure, then that level of mo- 
tivation must be a rather unstable quantity. 
It may be that an individual’s “true” level of 
motivation actually does fluctuate widely. In 
the light of what is known about other trait 
measures and on a priori grounds this seems 
unlikely, but evidence on this point is pres- 
ently unobtainable. Nevertheless, if n Achieve- 
ment is to have any value for longitudinal 
prediction, it must show more consistency 
than it has so far. 

Some slight evidence of validity could be 
indicated by the correlation of n Achievement 
with the HTS posttest. For the total male 
group the n Achievement pretest correlated 
.18 with the HTS posttest. However, the n- 
Achievement posttest, administered at ap- 
proximately the same time as the HTS post- 
test, correlated .00 with it. 





A few of the correlations with OAIS and 
SSHA are significant at the five per cent level. 
However, it must be remembered that when 
a large number of correlations are computed, 
a certain number may be “significant” by 
chance alone. Since the correlations are small 
and not consistent among groups, random 
variation seems the most logical explanation 
of these relationships. Similarly, the correla- 
tions with ACE show no consistent trends. 
For the total group they are nonsignificant 
and low. It would appear that for this sam- 
ple n Achievement is independent of scho- 
lastic aptitude. 


Summary 


The test-retest reliability of n Achievement 
(a measure of achievement motivation) and 
its relationship to certain other selected vari- 
ables has been computed on 169 students in 
How To Study classes at the University of 
Minnesota. A test-retest reliability of .26 
after a nine-week interval casts doubt on the 
stability as well as the possible validity of the 
measure. The correlations of n Achievement 
with an achievement examination in the course 
and with two other personality measures which 
correlate with academic success show no con- 
sistently positive or negative relationships. It 
is also independent of scholastic aptitude as 
measured by the ACE. 


Received August 16, 1956. 
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Level of Aspiration in a Group of 
Peptic Ulcer Patients’ 


Irving Raifman 
United States Naval Hospital, National Naval Medical Center, Bethesda, Md. 


Peptic ulcer is now commonly, even jok- 
ingly, regarded as an affliction that is pe- 
culiar to successful, self-driving businessmen. 
For years the professional literature on the 
subject has described those suffering from 
peptic ulcers as striving, overactive, ambitious, 
and efficient people. In 1934 Alexander and 
others (1, 2) postulated that one could dis- 
cover in these patients a “typical conflict” 
consisting of “intense receptive and acquisi- 
tive wishes against which the patient fights 
internally because they are connected with a 
sense of inferiority” (2, p. 127). A number of 
subsequent studies have provided support for 
this theory. 

This is a report of a study comparing the 
goal-setting and aspirational behavior of some 
peptic ulcer patients with that of a group of 
normal, and that of a group of psychoneurotic 
subjects (Ss). The Rotter Level of Aspiration 
Test (3, 4, 5) was selected for use, not only 
because it has been shown to be reliable, but 
also because the task involved is both enter- 
taining and ostensibly a matter of motor skill. 
Rotter’s instructions were modified by omit- 
ting the penalties for failures, in order to en- 


1 This paper is based on a doctoral dissertation 
submitted to the department of psychology, New 
York University in 1951. The study was discussed 
at the meeting of the APA in New York, Septem- 
ber, 1954. Grateful acknowledgment is due Professor 
Brian Tomlinson and other members of the super- 
visory committee, Dr. Robert Morrow, Chief Psy- 
chologist at the Bronx VA Hospital, New York, 
N. Y., and Miss Elizabeth Broomhead, Chief Psy- 
chologist at the United States Naval Hospital, Na- 
tional Naval Medical Center, Bethesda, Md. for 
their help, guidance, and cooperation. Opinions and 
conclusions are those of the author and do not re- 
flect the view or endorsement of the VA or the Navy 
Department. 


courage as much expression as possible of sub- 
jective, implicit ambition. 


Method 
Subjects and Procedure 


The experimental group was composed of 
15 patients with peptic ulcers who were hos- 
pitalized at the Bronx VA Hospital, New 
York. All were white male Ss with the ability 
to read and write English. Their ages ranged 
from 22 to 45, with a mean age of 31.80 years. 
The mean educational level was 10.53 years. 

One control group consisted of 15 men se- 
lected from the neuropsychiatric wards of the 
hospital and all diagnosed as “psychoneu- 
rotic.” The mean age of this group was 30.73 
and the educational achievement level was 
11.47 years. 

The other control group was selected from 
the general, medical, and surgical wards of 
the hospital. Their illnesses were not serious 
and not of a psychosomatic nature. This group 
was considered normal after each S had satis- 
fied the criteria established in the Cornell 
Selectee Index (6). The mean age was 30.87 
and the edycational achievement level was 
11.77 years. 

There was no characteristic difference among 
the groups with respect to age or education 
nor was there any peculiar clustering of their 
occupational levels. Twenty of the 45 Ss were 
in the clerical sales group, and 6 in the 
professional, semiprofessional and managerial 
groups. Of these 6, only one was an ulcer 
patient. 

Each man made a total of 55 shots on the 
pinball-like Rotter Board. The first five shots 
were for practice. The remaining shots were 
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divided into ten trials (five shots per trial), 
and before each trial the S was told the score 
he had just made and asked to predict what 
score he would be able to achieve on the next 
series. Using the figures thus obtained—the 
actual scores earned and the “bets” made— 
four final scores were computed in every case. 

The D score resulted from averaging the 
difference between the performance just made 
and the estimate that followed it. When the 
S’s bet was higher than the score he had just 
earned, his D score was positive; when he 
predicted that he would not do as well the 
next time, his D score was negative. This score 
reflects the discrepancy between aim and ac- 
complishment, with a high, positive D score 
indicating that the S was relatively tenacious 
in his optimism. 

The A score is the average of the difference 
between the S’s bet and the following per- 
formance. If his prediction was lower than 
the score he then achieved, his A score was 
positive; if his estimate was too high, his A 
score was negative. This score, then, reflects 
the discrepancy between the level of the goal 
set and the actual attainment. A negative A 
score implies failure to reach the desired goal. 

The 7 score is the average of the difference 
between the S’s score made during the prac- 
tice period and the score he estimated he 
would be able to make the next time. It is as- 
sumed that the more positive the J score, the 
greater is the tendency to set unrealistic goals 
when starting something new. 

The C score represents the change between 
the S’s first five and his last five D scores. A 
positive C score indicates that he kept im- 
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proving and also expected more of himself, 
while a negative C score reflects his willing- 
ness to lower his level of aspiration when 
faced with frustration and failure. 


Results and Discussion 


The figures in Table 1 point to the ulcer 
group as more striving than the normal or the 
psychoneurotic group. The C scores are not 
significantly different, but the other three in- 
dicate that there were significant differences 
between the patients with ulcers and the other 
patients. One might well conclude that per- 
sons suffering from peptic ulcers have diffi- 
culty curbing their aspirations, that they are 
less able than others to attain their goals be- 
cause they lack an appreciation of the prob- 
lems and limits involved, and that they are 
willing to gamble for high stakes without be- 
ing familiar with the task they are about to 
undertake. At any rate, this study, which is a 
part of a wider investigation of Alexander’s 
theory of the “typical conflict situation” of 
peptic ulcer patients, adds to the evidence 
that such persons do indeed tend to set them- 
selves impossibly high goals, and that they 
are different from normal and psychoneurotic 
groups on this score. 

All the scores of the normal Ss are con- 
sistent with what one would expect of them. 
In comparison with the ulcer patients, their 
needs are more realistic and it follows that 
they should attain more. They approach new 
tasks with relative caution, and as they be- 
come familiar with them, they move forward 
with the hope of doing better. 


Table 1 


Comparison of the Mean Scores on the Level of Aspiration Test for Ulcer Patients and Control Groups 























Ulcer (U) Normal (N) Neurotic (PN) it values 
Score Mean SD Mean SD Mean SD U-N U-PN N-PN 
D score 2.83 1.24 1.61 1.40 1.67 1.53 5 Fe: & rey 
A score —2.78 1.45 —1.39 1.55 —1.27 1.24 240°  296"* 23 
I score 5.87 5.17 87 4.30 2.53 4.78 2.78** 1.78 97 
C score —740 11.88 3.27 15.64 — 67 13.05 2.03 1.36 72 





* Significant at .05 level. 
** Significant at .01 level. 
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Summary 


The “typical conflict situation” in peptic 
ulcer patients which stimulates them to as- 
pire beyond their level of achievement sug- 
gests that ulcer patients differ from other pa- 
tients in goal-setting behavior. 

Fifteen veteran peptic ulcer patients were 
compared with a like number of normals and 
fifteen psychoneurotic patients on four meas- 
ures of their performance on the Rotter Level 
of Aspiration Board. The ulcer patients were 
significantly higher in their aspirations and 
lower in their attainment than either of the 
two control groups, and more inclined than 
the normal subjects to overestimate their 
ability at the beginning of the problem. All 
of these differences appear to indicate that 
ulcer patients are an ambitious lot who can- 
not achieve their aspirations because they set 
goals which to others seem insurmountable. 
The results support the impressions of 
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Alexander and others with regard to the as- 
pirational drives of peptic ulcer patients. 


Received September 19, 1956. 
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Identification with Photographs of People’ 


Jay L. Chambers 


Eastern State Hospital, Williamsburg, Virginia 


Several attempts have been made to deter- 
mine attitudes and personality factors through 
identification with photographs of people (1, 
2, 4, 5, 6). In the Szondi test, for example, 
patterns of likes and dislikes for photographs 
of different types of mental patients are as- 
sumed to reflect a subject’s own personality 
traits (3). 

The rationale of the identification hypothe- 
sis seems to be expressed by the saying “we 
like those who have the same traits as our- 
selves,” or, from an opposing point of view, 
“opposites attract.” The purpose of the pres- 
ent research was to test these alternative 
identification hypotheses. 

The possibility that a person may identify 
positively with photographs of people of the 
same sex but negatively with those of the op- 
posite sex was considered. The research re- 
ported here was limited to a study of identifi- 
cation with pictures of those of the same sex 
as the subjects (Ss). 

The trait of self-assertiveness was chosen to 
test the identification hypotheses. The term 
“self-assertive” appeared to be understand- 
able to the college students who were to serve 
as Ss and it could be rather objectively re- 
lated to overt behavior. Self-assertiveness was 
considered a trait of major importance in that 
it was defined similarly to traits such as domi- 
nance, ascendancy, aggressiveness, etc., which 
generally appear in factor analyses of person- 
ality measurement. 


Method 
Subjects 


A class of 33 junior and senior undergradu- 
ates in a course in Psychology of Personality 
at Muskingum College served as Ss. There 


1 This research was done at Muskingum College. 


were 15 men and 18 women in the class. It 
was felt that in the small liberal arts college 
where the research was conducted the students 
were well acquainted with each other and had 
a sufficient background in psychology to make 
fairly valid ratings and self-assessments. The 
Ss were naive with regard to the specific na- 
ture of the research and the connection be- 
tween the various tests. All of the tests were 
presented as part of a battery of tests and 
evaluations included in the course. The stu- 
dents were homogeneous with respect to age, 
intelligence, and socioeconomic background 
due to the selection involved in admission to 
the college and the attrition resulting from 
prerequisites for the course. 


Tests 


Picture identification test. Previous experi- 
menters have usually classified photographs 
on some a priori basis (e.g., diagnosis of a pa- 
tient) and then assumed the photograph had 
an equal stimulus value for all Ss. In order to 
correct for the possible weakness of this as- 
sumption, it was decided to permit each S to 
make his own selections of photographs he 
judged represented people high and low on 
the trait of self-assertiveness. This procedure 
allowed for individual differences and, at the 
same time, provided a direct determination of 
the stimulus value of a photograph for the 
individual. 

Pictures of college students, unfamiliar to 
the Ss, were selected from a college annual 
for the test. The pictures were of the same 
size and photographic quality and all of the 
students had dressed uniformly for the pic- 
tures. Girls wore dark sweaters and white 
beads and boys wore white shirts and dark 
ties. All pictures included the shoulders and 
face. The pictures were spaced in a square 
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pattern in groups of four on white 6-in. by 
8-in. cards. There were 15 groups of boys’ 
pictures for the male series and 15 groups of 
girls’ pictures for the female series. Each card 
in the series was designated alphabetically and 
the pictures were numbered for identification 
purposes. Men received the male series and 
women received the female series. The Ss were 
given a form with written instructions to se- 
lect from each group of four pictures the per- 
son they judged as the most self-assertive and 
the person they judged to be the least self- 
assertive. Self-assertiveness was defined as a 
“tendency to have definite opinions on sub- 
jects and to state them; a tendency to domi- 
nate or influence associates; a tendency for 
the person to have his (or her) own way and 
to not back down in a conflict.” 

When the judgments were completed, each 
S received a second form with written instruc- 
tions to select from each group of four pic- 
tures the person they felt they would like the 
best and the person they felt they would like 
least “as a friend.” In scoring the test a + 
was given for each instance where there was a 
direct relation between the self-assertive judg- 
ments and the affective reactions (a “most 
self-assertive” selection paired with a “liked 
most” selection or a “least self-assertive” with 
a “liked least” selection). A — was scored 
when there was an inverse relation between 
the selections (a “most self-assertive” paired 
with a “liked least” selection or vice versa). 
The score for the test was the sum of the 
minuses subtracted from the sum of the 
pluses. The possible score range was from 
— 30 to + 30 points. 


Rating scale. A six-point rating scale for 
the trait of self-assertiveness was devised em- 
ploying the Q-sort technique. Each S was in- 
structed to rate every other S, placing them 
in a quasi-normal distribution provided with 
the scale. The same definition of self-assertive- 
ness provided for the picture identification 
test was supplied with the rating scale. Each 
S received a self-assertive rating score which 
was the average of all the ratings he received 
on the six-point scale. 


Guilford-Martin GAMIN test. The GAMIN 
test was selected as providing an “ascend- 
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ancy” scale (A scale) and a “lack of inferi- 
ority” scale (J scale). It was felt that both of 
these scales might be related to the trait of 
self-assertiveness as previously defined. The 
test was administered and scored according to 
the standardized procedures. Raw scores for 
the A and 7 scales were used for purposes of 
analysis. 


Results 


The results were withheld from the Ss until 
all of the tests had been completed. Upon 
questioning at this time, only four Ss seemed 
to have recognized the possibility of correlat- 
ing the two parts of the picture identification 
test. The majority of the students expressed 
the feeling that there would only be a chance 
association between their judgment choices 
and their affective reactions. Thus, it could 
be reasonably assumed that any identification 
which took place was primarily at a noncon- 
scious level. This was not true for the GAMIN 
where the majority of the students felt they 
could recognize the significance of many items 
and, of course, the purpose of the rating scale 
was self-evident. 

As there were no significant differences be- 
tween the sexes on any of the measures the 
Ss were pooled into one group for correlation 
purposes. Pearson r was used except for the 
correlation between the rating scale and the 
picture identification test where a rank-order 
correlation was employed due to the restric- 
tion on variation of the rating scale scores. 

All of the correlations are in a direction to 
support a positive identification hypothesis. 
The Ss who preferred photographs of people 
of their own sex whom they had previously 
judged to be high on self-assertiveness tended 
to have high scores on measures of ascend- 
ancy, lack of inferiority, and self-assertive- 
ness. The Ss who least preferred photographs 
of people they judged to be high on self- 
assertiveness, tended to have relatively low 
scores on measures of ascendancy, lack of in- 
feriority, and self-assertiveness. None of the 
correlations between the picture identification 
test and the trait measures were high. The 
correlation with lack of inferiority was + .52, 
significant at the .01 level, the correlation with 
ascendancy was + .42, significant at the .05 
level, and the rank-order correlation with the 
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rating scale measure of self-assertiveness was 
+ .32, significant at only the .10 level. 

Further analysis yielded a Pearson ¢ of 
+ .48 between the ascendancy and lack of 
inferiority scales of the GAMIN. This was 
not surprising in view of the fact that both 
scales correlated significantly with the pic- 
ture identification test. The results suggest 
that, for the Ss used in this research, the A 
and J scales of the GAMIN are not inde- 
pendent. 


Discussion 


Support for the positive identification hy- 
pothesis, obtained from this study, suggests 
other applications of a picture identifica- 
tion technique to personality measurement. 
Through the picture identification technique 
it may be possible to obtain quantitative 
measurement of many personality dimensions 
by correlating a S’s ratings of photographs 
along a given personality dimension with his 
affective reactions to the same photographs. 
A picture identification technique applied to 
personality measurement offers the advan- 
tages of objective, quantitative scoring, little 
reliance on verbal stimuli, and a measurement 
largely based on unconscious projection. These 
advantages and the results of this study in- 
dicate that further research in this area may 
be profitable. 


Summary 


Eighteen women and fifteen men under- 
graduate college students were given photo- 
graphs of unfamiliar college students and 
were asked to select pictures they judged to 
represent people high and low on the trait of 
self-assertiveness. Male Ss judged men’s pic- 
tures and female Ss judged women’s pictures. 
The Ss were then asked to indicate which of 
the same pictures they liked best and least. 
A measure of the strength and direction of the 
relationship between the Ss’ judgments and 
affective reactions was obtained from this pic- 
ture identification test. 


Jay L. Chambers 


The Ss also rated each other by a Q-sort 
technique for the trait of self-assertiveness, 
and scores on the Guilford-Martin Ascend- 
ancy and Lack of Inferiority scales of the 
GAMIN test were obtained. Correlations be- 
tween the picture identification test and the 
trait measures of lack of inferiority, ascend- 
ancy, and self-assertiveness were significant at 
the .01 level, the .05 level, and the .10 level, 
respectively. None of the correlations were 
high, but all were in a direction to support 
the hypothesis of positive identification with 
photographs of people of the same sex. The 
Ss who tended to have a positive relationship 
between their preferences for photographs and 
their ratings of the same photographs along 
a self-assertive personality dimension, tended 
to score high on measures of ascendancy, lack 
of inferiority, and self-assertiveness. A nega- 
tive relationship between the picture ratings 
and preferences was obtained for Ss scoring 
low on the traits of ascendancy, lack of in- 
feriority, and self-assertiveness. 

The results indicate that a picture identifi- 
cation technique may be profitably applied to 
personality measurement when the S is per- 
mitted to determine the stimulus value of a 
photograph. 
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The Goodenough Draw-A-Man Test as a Measure of 
Intelligence in Aged Adults’ 


Allan W. Jones* and Thomas A. Rich* 


Historically, the Draw-A-Man test was de- 
veloped for children, with whom it has proved 
to be a successful indicator of general intel- 
lectual level (4). Until now, its use with 
adults has been confined primarily to mental 
defectives (1, 2, 5). There is certain prelimi- 
nary evidence, however, that this technique 
could be used with older normals as well. Al- 
though not concerned specifically with intelli- 
gence as a variable, Lorge, Tuckman, and 
Dunn (9) have demonstrated decided quali- 
tative differences between the drawings of 
older and younger adults in such character- 
istics as proportion, motor coordination, and 
depth. Also, in a finding not detailed in their 
study, Kleemeier, Rich, and Justiss (7) noted 
a marked correspondence between quality of 
human figure drawings and general perform- 
ance level on a psychomotor evaluation of an 
elderly group. In addition, height of drawing 
appeared to be related positively to perform- 
ance level. 

From these observations, it appeared that 
human figure drawings of older adults might 
be useful as an indicator of their general 
intellectual functioning. In this study the 
Goodenough test was employed for drawing 
evaluation and the Wechsler-Bellevue for the 
measure of intelligence. 


1 This investigation was supported in part by a re- 
search grant M1057 from the National Institute of 
Mental Health of the National Institutes of Health, 
Public Health Service. The authors are indebted to 
Dr. Robert W. Kleemeier, Director, Moosehaven Re- 
search Laboratory for a critical reading of the manu- 
script. 

2 Now with the RAND Corporation, Santa Monica, 
California. 

8 At the University of Florida. 
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Procedure 


The Ss in this study were 40 male resi- 
dents of a fraternal home for the aged (6). 
Their average age was 78.5 (SD = 6.2), 
ranging from 67 to 93 years. The modal edu- 
cation was 8 years with a range from 3-14 
years. All were volunteers, as well as partici- 
pants in an ongoing study of age changes in 
intelligence. 

Each S was tested individually in a single 
testing session of approximately 14 hours’ 
duration. The Wechsler-Bellevue Intelligence 
Test, Form I, was administered, followed by 
the Goodenough Draw-A-Man Test. For the 
Goodenough, each S was provided with an 
84-xX-ll-inch sheet of blank paper and a 
pencil. He was then instructed to draw a man 
and to make the drawing as complete as pos- 
sible. No time limit was set. 

All the drawings were scored independently 
by both the authors according to the pro- 
cedure described by Goodenough (3). The 
total score for each subject was the average 
of the two separate ratings. The height of 
each drawing to the nearest millimeter was 
also determined once by each author. In the 
few cases where a slight discrepancy existed, 
the average of the two measures was used 
In addition, Wechsler Full Scale, Verbal, and 
Performance IQs were obtained by means of 
the standard age correction (14). 


Results 


Goodenough reliability. The Pearson prod- 
uct-moment correlation between the two sets 
of independent ratings on the Goodenough 
was .84. This compares favorably with other 
studies reporting interjudge correlations rang- 
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Table 1 


Descriptive Statistics on Wechsler and 
Goodenough Test Scores 








Mean SD Range 

Goodenough 21.86 10.29 2.0-45.0 
Drawing Height 10.24 4.34 1.0-18.4 
W-B Full IQ 99.80 11.65 70-126 
Verbal IQ 100.48 12.07 72-126 
Performance IQ 105.18 9.08 80-128 
WB full wtd. score 62.63 22.20 ,9-111 
Verbal wtd. score 36.10 13.54 7-67 
Perf. wtd. score 26.53 10.29 2-53 


ing from .80 to .96 (10, 13). Previous studies 
(3, 10) with younger groups have obtained 
test-retest reliabilities from .68 to .94. Re- 
garding height of drawing, Kleemeier, Rich, 
and Justiss (7) found test-retest reliability 
with older males to be .80 over a two-week 
interval. Similarly, Lehner (8) reports no sta- 
tistical difference between drawing heights in 
two testings. 

Descriptive statistics. Table 1 presents the 
means, SDs, and ranges of Goodenough total 
score, drawing height, Wechsler IQs and 
Wechsler weighted scores for the 40 cases. 
Drawing height is reported in centimeters. 

Compared to Goodenough’s normative data 
(4), the average drawing score of 21.86 for 
the elderly group represents approximately 
the 8-year level. However, the SD of 10.29 is 
almost twice that found with the children 
(5.4), a finding due in part to the compara- 
tively greater age spread in the older sub- 
jects. 

Relation of Goodenough to Wechsler. In 
Table 2 are shown the intercorrelations be- 
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Fig. 1. Averaged Goodenough Test score plotted 
against W-B Full Scale IQs. 


tween Goodenough test score, drawing height, 
and the various W-B IQs and weighted scores. 
The r between Goodenough and height of 
drawing is .64. Both of these appear to pre- 
dict about equally Wechsler Full Scale, Ver- 
bal, Performance IQs, and full weighted score 
ranging from the highest r of .65 between 
Goodenough and W-B Full Scale to the low- 
est of .47 between drawing height and W-B 
Performance IQ. All these rs are significantly 
different from zero beyond the .01 level. 

Figure 1 shows the 40 averaged Good- 
enough scores plotted against their respective 
Wechsler IQs. There are but four cases which 
deviate markedly from the rest of the sam- 
ple. All of them have had either special train- 
ing or an interest in art. 

The rs between the Goodenough and the 
11 W-B subtests form the first column in 
Table 3. Next are listed these same rs with 
age partialled out. Since both the Goodenough 


Table 2 


Goodenough and Wechsler-Bellevue Intercorrelation Matrix 





c D E F G 





A B 

A. Age 

B. WB Full IQ — .32* 

C. WB Verb. IQ —.14 .95** 
D. WB Perf. IQ —.17 86** 
E. WB wtd. score — .28 .98** 
F. Goodenough — 49** .65** 
G. Drawing Height —.31 = 





*>< .05 level. 


** > < .01 level. 
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The Draw-A-Man Test in Aged Adults 


Table 3 


Correlations Between Wechsler-Bellevue Subtests and 
Goodenough Score, and Between W-B Subtests 
and Goodenough with Age Partialled 








Goodenough : 





Wechsler subtests Goodenough Age constant 
Picture Completion 65 54 
Digit Symbol .60 AS 
Comprehension 58 50 
Information 54 A8 
Object Assembly 53 52 
Picture Arrangement 53 38 
Vocabulary 52 49 
Arithmetic .50 A2 
Block Design A9 35 
Similarities Al 30 
Digit Span 34 31 





r of .32 = .05 level; r of .41 = .01 level. 


and the W-B subtests correlate with age, 
partial correlation coefficients were computed 
holding age constant. It was felt that the re- 
lationship between the Goodenough and the 
W-B subtests would be more meaningful with 
the age factor controlled. Considering these 
partial correlations, Similarities and Digit 
Span do not correlate significantly (.05 level) 
with the Goodenough. Block Design and Pic- 
ture Arrangement reach the .05 level, and all 
other subtests are at the .01 level or beyond. 
However, holding age constant does not alter 
essentially the general relationship between 
the various subtests and the Goodenough. 
From inspection of the partial correlations, 
both verbal and performance tasks appear to 
be about equally involved in drawing produc- 
tion. The highest relationships are found with 
Picture Completion, Object Assembly, Com- 
prehension, Vocabulary and Information. 

When age was also partialled out of the 
correlations between Wechsler Full Scale, Ver- 
bal, and Performance IQs, and height. of 
drawing, no significant changes were noted. 
However, there was a small but progressive 
drop in height of drawing with age with a 
mean figure height of 11.65 cms. in the 66-75 
age group, 9.72 cms. for those between 76 
and 85, and, finally, 7.96 cms. for subjects 
age 86 and above. Other studies (10, 13) 
have reported similar decreases in height of 
figure drawing with age. 
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Qualitative differences. The percentage of 
the older males receiving credit for each of 
the 51 Goodenough items was computed. 
These percentages were compared to those of 
the normal 8-year-old children in the original 
Goodenough data (3), a group comparable 
only in terms of average Draw-A-Man score. 
Seven items differentiated young and old by 
at leasi 32 percentage points. The greatest of 
these differences showed that 98% of the chil- 
dren included clothing in their drawings, as 
opposed to only 31% of the oldsters. Too, 
children were more inclined to draw both 
arms and legs in two dimensions (86% vs. 
51%), and to score a point for motor co- 
ordination (77% vs. 45%). The older males 
drew ears more frequently (64% vs. 32%), 
indicated shoulders (50% vs. 12%), had the 
length of trunk greater than the breadth (81% 
vs. 49%), and drew the outline of the neck 
continuous with the head, trunk, or both 
(65% vs. 32%). 

The intrusion of variables other than age 
in comparing these children and older adults 
makes detailed interpretation unwarranted. It 
does appear, however, that there are many 
decided qualitative differences between young 
and old human figure drawings. Undoubtedly, 
a scale designed specifically for an older group 
could detail the differences much more ade- 
quately. 


Discussion 


Although the Draw-A-Man test is not of- 
fered as a substitute for a more complete in- 
telligence testing in the aged, it does appear 
that a quick estimate of intellectual level can 
be obtained by its use. Certainly, it would be 
most appropriate for those cases not amenable 
to a more formal psychometric evaluation. 
Through use of a scale modified along the 
lines Lorge suggests (9), it might be possible 
to increase the over-all intelligence prediction 
accuracy. Still, it is quite interesting to note 
that height of drawing predicts IQ, as well as 
the detailed Goodenough scoring procedure. 
To be sure, this finding needs cross valida- 
tion; but because of the generally poor qual- 
ity of drawing in the aged (average 8-year 
level) and decline in psychomotor control 
evident, it may be that drawing height re- 
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flects not so much age as total available mo- 
tor output. This would undoubtedly be more 
likely true for the somewhat deteriorated case, 
where a certain minimal energy level to draw 
at all is needed. 

A final point to be considered is that the 
Goodenough test is undoubtedly influenced 
by both previous training and experience in 
art. For Ss with such a background the draw- 
ing test constitutes a different kind of task 
and probably reflects training rather than in- 
tellectual level. The scatter plot shown in 
Fig. 1 lends support to this, with the four 
atypical cases being Ss either at present ac- 
tive in art or expressing strong interest in it 
in the past. 


Summary 


The Wechsler-Bellevue Intelligence Test, 
Form I, and the Goodenough Draw-A-Man 
Test were administered to 40 aged (M= 
78.5) male residents in a fraternal home for 
the aged to test the relationship between hu- 
man figure drawings of older adults and their 
general intellectual level. 

The two drawing variables obtained were 
Goodenough test score and height of figure 
in centimeters. Both measures correlated about 
equally with W-B Full Scale, Verbal, and Per- 
formance IQs, and total weighted score, with 
correlations ranging from .47 to .65 (p < .01 
level). Also, partialling out age did not alter 
the essential relationship between Goodenough 
and the W-B subtests. 

The average drawing score of 21.86 was 
comparable to that obtained by the typical 
8-year-old, although marked qualitative dif- 
ferences were found. It was suggested that the 
Goodenough could be used as a quick esti- 
mate of IQ in aged adults, but that a scoring 


procedure designed specifically for such peo- 
ple might raise the over-all predictability. 
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On WAIS Difference Scores 


Quinn McNemar 
Stanford University 


Although differences between subtest scores 
on the Wechsler scales are of supposed diag- 
nostic significance, the recent WAIS Manual 
(2) does not include information pertaining 
to norms for, or reliabilities of, any of the 55 
possible difference scores among the 11 tests. 


tests, as given in the manual, it is possible to 
extract data needed for evaluating difference 
scores. This note will present such data for 
ages 25-34, for which NV = 300. 

The mean of the algebraic differences will, 
of course, be zero. The standard deviation of 





On the basis of the reliabilities of the 11 


any distribution of differences is readily ob- 
tests and the intercorrelations among the 


tained by utilizing the well-known formula 


Table 1 


Reliabilities (Below Diagonal) and SDs (Arabic Above Diagonal) for Difference Scores, and 
Standard Errors of Measurement of the Difference Scores (Italics) 





Subtest 1 2 3 4 5 6 





7 8 9 10 11 
1. ‘Information _ +2? gee mee Be? Saleh 7 SR > aS ee 8 
17 1.6 1.5 2.0 Il 1.2 1.5 1.5 2.1 19 
2. Comprehension A7 3.0 2.6 3.3 2.2 3.2 2.8 3.0 2.8 3.2 
19 18 2.3 1.6 Li 18 19 24 2.2 
3. Arithmetic 59 59 2.8 3.0 2.7 3.2 3.0 3.0 3.0 3.4 
18 22 15 1.6 17 18 2.3 2.1 
4. Similarities 60 50 62 3.1 2.2 2.9 2.8 2.9 2.9 3.3 
21 1.3 14 1.6 17 2.2 21 
5. Digit Span 54 53 48 55 3.0 3.3 3.3 3.3 3.1 3.6 
1.9 19 21 2.1 2.6 2.4 
6. Vocabulary .63 A8 71 62 .60 2.7 2.6 2.9 2.6 3.2 
11 1.3 14 2.0 18 
7. Digit Symbol 80 72 76 76 64 .84 3.1 3.1 3.0 3.2 
14 1.5 2.1 19 
8. Picture Completion 64 57 66 66 60 74 .78 2.6 2.8 2.9 
17 22 2.1 
9. Block Design £9 61 63 .67 58 77 .76 58 2.7 2.6 
2.3 21 
10. Picture Arrangement 36 27 42 43 30 Al 51 36 32 2.9 
2.5 

11. Object Assembly 63 52 60 61 53 68 64 A9 37 .25 
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for the variance of the difference between 
correlated scores. The 55 needed (normative) 
standard deviations, opp values, are given as 
the Arabic numbers above the diagonal in 
Table 1. It will be noted that these opp’s 
vary from 1.8 to 3.6 (in scaled score points). 

The reliability coefficients, calculated by a 
formula given by Kelley (1, p. 415), for the 
55 sets of difference scores are given below 
the diagonal. These coefficients vary from .84 
down to .25, with a median of .60. Only 10 
of the 55 reliabilities exceed the not so re- 
spectable value of .70. 

The 55 standard errors of measurement for 
the difference scores are set in italics above 
the diagonal in Table 1. These were calcu- 
lated by the usual formula for o¢, i.e., 07» 4) 
= o*pp(1 — rag), then checked by another 
formula from Kelley (1, p. 415). Note that, 
as might have been expected from the low re- 
liabilities, these o,,¢)’s are fairly large relative 
to the respective SDs for distribution of dif- 
ference scores. In other words, a sizable por- 
tion of a given difference score variance is 
attributable to errors of measurement. 

The practical meaning of the foregoing 
should be obvious. When we consider that 
the search for “significant” differences from 
among 55 possible differences tends always to 
capitalize somewhat on chance, it would not 
seem unreasonable to insist that any obtained 
difference be about 2.5 times the appropriate 
Gea) before accepting the difference as non- 
chance. But even when a difference is so 
judged as indicative of a real disparity in 
“ability,” it is still necessary to raise the 


question as to its possible diagnostic signifi- 
cance. It is here that the SD of the distribu- 
tion of difference scores for the normative 
group needs to be considered. Take, for ex- 
ample, a difference of 5 for Arithmetic versus 
Comprehension. Such a difference is “signifi- 
cant” at the .01 level so far as error of meas- 
urement is concerned, but the fact that the 
normative distribution of differences between 
these two tests has an SD of 3.0 indicates that 
10 per cent of normals will have differences 
as large as 5 points. 

The above example was chosen to be typi- 
cal; the range for this sort of thing is from 
4 per cent to as many as 30 per cent of nor- 
mals yielding “significant” differences (as 
judged by error of measurement). Could it 
be that so many “normals” among the age 
25-34 standardization group are producing 
abnormal differences? 

When one considers the reliabilities and in- 
tercorrelations of the 11 tests given in the 
WAIS Manual for other age levels, there is 
no reason for believing that the reliabilities 
for difference scores at other age levels will 
deviate much from the figures given in 
Table 1 for the 25-34 age bracket. 
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The Effects of Scale and Practice on WAIS 


and W-B I Test Scores’ 


Samuel Karson, Kenneth B. Pool, and Sheldon L. Freud 
School of Aviation Medicine, USAF, Randolph Air Force Base, Texas 


The newly standardized Wechsler Adult In- 
telligence Scale was developed to compensate 
for several recognized defects in the older 
Wechsler-Bellevue Intelligence Scale, Form I.” 
Although a number of the original W-B I 
items have been retained in the WAIS, the 
new scale is claimed to provide a more ade- 
quate range of item difficulty, more reliable 
subtests, and to be based on a more repre- 
sentative reference sample (6). 

The introduction of the new scale has raised 
several problems for psychologists. Problems 
of equivalence are involved in cases where 
scores from both scales may have to be con- 
sidered together, as is frequently necessary in 
clinical, research, and statistical operations. 
Problems of transfer from one scale to the 
other will have to be considered when pa- 
tients are retested, for any reason, with a 
second scale. Since there is no alternate form 
of WAIS, W-B I may frequently be used in 
this way. 

These problems have already been experi- 
enced by the writers in their Air Force situa- 
tion, and the present study was undertaken 
to assess the equivalence of WAIS and W-B I 
scales and to evaluate the transfer effects from 
one to the other on a sample of Air Force fly- 
ing personnel referred to the School of Avia- 


1 The authors would like to express their apprecia- 
tion to Dr. Samuel C. Fulkerson for his aid and ad- 
vice in the analysis of variance design used in this 
study, to Lt. John Sussenberger for his assistance in 
checking the accuracy of all of the original test 
scores, and to Drs. S. B. Sells and A. L. Kubala for 
their comments and suggestions. 

2 Hereinafter WAIS is used as an abbreviation of 
the Wechsler Adult Intelligence Scale and W-B I as 
an abbreviation of the Wechsler-Bellevue Intelligence 
Scale, Form I. 
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tion Medicine, USAF, for medical and psy 
chological evaluation. 

The specific questions investigated were (a) 
the equivalence of comparable scores on the 
two scales (this was studied both by correla- 
tion and by comparison of scores of the same 
subjects on both scales with order of adminis- 
tration rotated experimentally) ; and (4) prac- 
tice effects from WAIS to W-B I and also 
from W-B I to WAIS. Since the performance 
scale tests are generally speeded, and for the 
most part involve manipulation of the test 
materials, it was expected that transfer would 
be greater for these subtests. 

The sample consisted of 52 Air Force flyers 
whose average age was 33, with a range of 20 
to 39 years. Their mean number of years of 
education was 14, with a range of 12 to 16 
years. These officers were referred to the De- 
partment of Clinical Psychology as a part of 
their consultation at the School of Aviation 
Medicine. They were referred for evaluation 
between July, 1955, and May, 1956, and were 
tested with the WAIS and W-B I as part 
of their diagnostic psychological evaluation. 
Each individual was administered the WAIS 
and W-B I by the same psychologist, the scale 
given first being alternated so that one-half of 
the sample had W-B I first, while the other 
half had WAIS first. Almost all of the testing 
was accomplished by one experienced ex- 
aminer, although two other examiners also 
participated in the test administration to 4 
limited extent. The sample employed in this 
research is believed to be typical of the fly- 
ing personnel who are referred for consulta- 
tion in the United States Air Force. The mean 
and standard deviations of the WAIS Full 
Scale IQ for our sample are 121.98 and 6.50, 
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Table 1 


Analysis of Variance Results of the Effects of Scale and Practice and Product-Moment Correlations 








Comparison of scales* 


Effects of practice* 














W-B I scores® WAIS scores? First Second Product 

F admin. admin. F moment 
Variable M SD M SD ratio mean® mean°_ ratio ri 
Information 12.67 1.55 13.27 1.35 14.66° 12.79 13.15 5.50° .68 
Comprehension 13.77 1.65 15.06 2.08 21.96° 14.06 14.77 6.70° .43 
Digit Span 11.38 2.93 12.79 3.11 30.50° 11.75 12.42 7.02° .82 
Arithmetic 14.02 2.80 13.46 2.24 2.12 13.27 14.21 6.08 .37 
Similarities 13.17 1.96 13.00 1.71 0.72 12.77 13.40 9.56° .75 
Vocabulary 12.86 1.13 13.23 1.94 3.28 12.77 13.33 7.64° .67 
Picture Arrangement 13.04 3.22 12.90 2.98 0.10 12.12 13.82 14.94° 45 
Picture Completion 13.73 0.94 13.61 2.37 0.18 13.37 13.98 5.48° .62 
Block Design 14.08 1.89 13.50 2.12 5.82¢ 13.38 14.19 11.38° .65 
Object Assembly 14.00 1.88 14.06 2.62 0.02 13.23 14.83 19.34° .31 
Digit Symbol 12.42 2.20 11.75 2.75 6.16° 11.35 12.83 29.78 .70 
Verbal IQ 121.19 6.43 120.29 6.84 1.92 118.92 122.56 31.02° .72 
Performance IQ 125.65 8.61 121.44 8.85 10.36° 119.31 127.79 42.00° .25 
Full Scale IQ 125.38 6.43 121.98 6.50 16.40° 120.60 126.77 53.94° .46 








® The scale and practice analyses of variance were evaluated using as the error term the variance of the residual subjects within 


cells, df 1 and 50. 


b> Mean scores for the total sample with order of administration alternated. 
© Mean scores for W-B I and WAIS combined at each period of administration. 


4Anr of (33 is significant at the {95 levels of confidence when N = 52. 


* Denotes significance. An F ratio of 4.03 is significant at the .05 level of confidence and 7.17 at the .01 level in the table. 


respectively, as compared with a mean of 
110.23 and a standard deviation of 25.25 on 
the WAIS standardization sample for the age 
range 20 to 34. It appears that the present 
sample overlaps the WAIS standardization 
sample principally in the superior range of 
intelligence. 

Bartlett’s test for homogeneity of variance 
(3) among the 14 corresponding measures 
(ie., the 11 subtests* of WAIS and W-B I 
and their VIQ, PIQ, and FSIQ) yielded sig- 
nificant chi squares only for Voc (x? = 18.36, 
p = < .O1) and PC (x? = 37.53, p= < 01). 
These results, which revealed significant het- 
erogeneity for Voc and PC, make the inter- 
pretation of the respective F ratios less cer- 
tain. For the other subtests the null hypothesis 
could not be rejected. Thereafter, three sepa- 
rate scale-, practice-, and scale-by-practice 


8 The following abbreviations are used for the sub- 
tests: Inf=Information; Comp = Comprehension; 
Arith = Arithmetic; Sim = Similarities; Dig Sp= 
Digit Span; Voc = Vocabulary; DS = Digit Symbol; 
PC = Picture Completion; BD = Block Design; PA 
= Picture Arrangement; and OA = Object Assembly. 
Also, the following were used: VJQ= Verbal IQ; 
PIQ = Performance IQ; and FSIJQ = Full Scale IQ. 





analyses of variance (3) were completed for 
the WAIS and W-B I scores on each of the 
11 subtests and for the VIQ, PIQ and FSIQ 
scores. In all, 42 analyses of variance were 
done. The equivalence of subtest scores and 
IQ’s was further compared by means of Pear- 
son coefficients of correlation, since it was be- 
lieved that the person-to-person subtest score 
equivalence could be low even though the 
group means might be similar. In order to 
avoid the contamination of practice effects, 
the total sample was not pooled; instead, cor- 
relations were computed between the 26 per- 
sons who took each of the two scales first and 
then for the 26 persons who took each of the 
two scales last. Then, by means of Fisher’s z 
transformation (3), the two correlations were 
averaged. 
Results 


Table 1 presents a comparison of the means 
of the two scales, with order of administration 
controlled, and the effects of practice when 
either scale was administered first. Three of 
the verbal subtests (Jnf, Comp, and Dig Sp) 
show significant mean differences, the WAIS 
scores being higher. Two of the performance 














Effects of Scale and Practice on the WAIS and W-B 1 


subtests (BD and DS) also show significant 
mean differences in favor of the W-B I scores. 
Two of the W-B I IQ scores (PIQ and FSIQ) 
are significantly higher than the correspond- 
ing WAIS scores. These results appear to be, 
at least in part, a function of the different 
weighting systems used on WAIS and W-B I 
in the transformation of the raw scores to 
weighted scores. Specifically, a total weighted 
score of 65 on the W-B I verbal. subtests, 
which is equivalent to a weighted score of 78 
on WAIS, yields a VIQ three points higher on 
W-B I than the VIQ on WAIS for the age 
range 20 to 24, and four to five points higher 
for the age range 25 to 34. It is of interest to 
observe in this connection that the age range 
represented by the present sample is very 
similar to that of the sample selected as the 
reference group upon which the scaled scores 
for the WAIS are based (6). The transforma- 
tions to IQ units of the scaled scores earned 
on the verbal subtests appear to have resulted 
in greater comparability of the VIQ’s on the 
two scales; this F ratio is not significant, even 
though significant differences were observed 
between these two scales on three of the verbal 
subtests. On the performance subtests, how- 
ever, the transformations to IQ units of the 
scaled scores apparently did not operate to 
reduce the significant differences observed be- 
tween the subtests of BD and DS, since the 
F ratio between the two PIQ’s is significant 
beyond the .01 level of confidence. 

Significant practice effects were observed in 
all of the 14 comparisons made without re- 
gard to scale. The verbal subtests were ap- 
parently less affected by practice than the 
performance subtests, as evidenced by an ex- 
amination of the significance levels of the F 
ratios reported in Table 1 for the effects of 
practice, as well as by inspection of the actual 
mean increments attributable to practice for 
the individual subtests. The actual magnitude 
of the differences between the pre- and post- 
practice means should not be taken as an 
index of the amount of transfer to be ex- 
pected on either scale, since these means rep- 
resent an average of two nonequivalent scales. 
These findings are in keeping with a priori 
expectations that the verbal subtests would 
show relatively fewer practice effects than the 
performance subtests, 
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Intercorrelations between W-B I and WAIS 
ior the present study are listed in the last 
column of Table 1. Although most of the 
group means for W-B I and WAIS are quite 
comparable, it is observed that the person-to- 
person correspondence for some of the sub- 
tests was low, notably OA, Arith, Comp, and 
PA. The low correlation between the two 
PIQ’s of only .25 is also notable. These re- 
sults suggest that one could not predict with 
much accuracy one subtest score from the 
other, on these particular subtests, for our 
sample. 

The scale-by-practice interaction is not re- 
ported, since the /mf subtest had the only sig- 
nificant F ratio here (F = 6.60, p= < .05). 
This interaction was evaluated by using the 
variance of the residual subjects between cells 
as the error term, with degrees of freedom 
being 1 and 50. This is interpreted to indi- 
cate that, for the most part, practice effects 
do not appear to depend on which of the two 
forms is administered first. 

Table 2 presents the means and standard 
deviations for WAIS and W-B I when each 
of these tests is administered first, as well as 
the Pearson product-moment correlation co- 
efficients for the 26 patients who were ad- 
ministered the WAIS initially, and the 26 
who were administered W-B I first. Here 
again, although the correlations between the 
two forms are quite low in some instances 
(notably an r of only .07 for Arith, .26 for 
OA, — .12 for PIQ, and .15 for FSIQ) when 
WAIS is given first, much similarity among 
the means is observed. Low correlations are 
also in evidence when W-B I is administered 
first, although again the means are quite simi- 
lar. The correlation between the two Comp 
subtests is only .33, that between the two OA 
subtests only .37, between the two subtests of 
PA .40, and between the two Voc subtests .45. 

In comparing the two sets of correlations 
in Table 2, it is difficult to understand the 
fluctuations in the observed correlations as- 
sociated with scale, i.e., which of the two tests 
is given initially. With WATS administered 
first, in contrast to when W-B I was taken 
initially, the correlation between the two Arith 
subtests increased from .07 to .61; that be- 
tween the two PIQ’s from — .12 to .56; and 
that for the FSIQ’s from .15 to .69. More- 
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Table 2 








WAIS admin. first 








WAIS scores W-B I scores 

Vari.vle M SD M SD 
Information 13.54 1.45 13.31 1.35 
Comprehension 14.65 2.45 14.08 1.66 
Digit Span 12.46 3.09 11.73 2.98 
Arithmetic 13.35 2.02 14.85 2.27 
Similarities 12.58 1.60 13.38 1.67 
Vocabulary 12.69 1.81 12.88 1.37 
Picture Arrangement 11.96 2.62 13.81 3.24 
Picture Completion 12.92 2.35 13.65 0.78 
Block Design 13.50 2.22 14.88 1.58 
Object Assembly 13.31 2.78 14.85 1.79 
Digit Symbol 10.92 2.74 13.08 2.39 
Verbal IQ 118.73 6.11 123.27 5.38 
Performance IQ 117.08 7.33 129.77 7.66 
Full Scale IQ 5.43 128.65 5.19 


119.08 





* For significance at the ‘65 levels of confidence an r must 


\- 


over, when WAIS was given first, the three 
highest subtest correlations (notably Dig Sp, 
Sim, and Voc) showed a marked reduction in 
size when the order of administration was re- 
versed. No obvious explanation of these effects 
is apparent, although they may be only a 
function of the small sample size used in this 
study. 


Discussion 


These results are similar to those found by 
Barry et al. (1) who compared W-B I and 
W-B II on a sample of military flyers. Those 
authors reported significant practice effects on 
VIQ, PIQ, and FSIQ, as well as on Dig Sp, 
Arith, BD, and DS. The present study found 
significant practice effects on all three of the 
1Q’s, as well as on all of the verbal and per- 
formance subtests. Barry e¢ al. also reported 
significant scale effects on Sim, PC, OA, and 
DS, while none of these particular subtests, 
except for DS, was found to be significant in 
this study. It may be that the DS subtests on 
the Wechsler scales were more prone to the 
effects of both practice and scale than any of 
the other subtests. 

It is of interest also to compare the present 
findings with those of Cole and Weleba (2) 
who studied 46 college students on WAIS and 


Product 


Comparison of WAIS and W-B I Scales for Two Subsamples (V = 26) 








W-B I admin. first 











W-B I scores WAIS scores Product 

moment - moment 
r® M SD M SD r® 
58 12.04 1.48 13.00 1.18 75 
50 13.46 1.58 15.46 1.53 33 
90 11.04 2.84 13.11 3.09 69 
07 13.19 3.03 13.58 2.44 61 
86 12.96 2.19 13.42 1.71 58 
81 12.85 0.82 13.77 1.91 AS 
51 12.27 3.02 13.85 3.01 40 
51 13.81 1.08 14.31 2.18 .70 
51 13.27 1.83 13.50 2.01 75 
.26 13.15 1.56 14.81 2.20 37 
67 11.77 1.76 12.58 2.48 73 
.67 119.11 6.72 121.85 7.17 76 
12 121.54 7.46 125.81 8.04 56 
15 122.11 5.89 124.92 6.16 69 
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388 when N = 26. 


W-B I. They reported significant practice ef- 
fects for all three IQ scores upon administra- 
tion of the second test, and they observed 
that the greatest practice effects were found 
on the performance subtests. They obtained 
the following correlations between IQ scores 
for the two tests: verbal .87, performance .12, 
and full scale .52. Our corresponding correla- 
tions are verbal .72, performance .25, and full 
scale .46. They also reported correlations be- 
tween VIQ and PIQ of .33 on WAIS, and of 
.14 on W-B I. In our study with WAIS ad- 
ministered first, the Pearson r between WAIS 
VIQ and PIQ was .19, while the correspond- 
ing correlation was .14. With W-B I given 
first, the correlation between VIQ and PIQ on 
WAIS was .09 and the corresponding W-B I 
correlation was .17. With regard to effects of 
practice and the correlations of the verbal and 
performance IQ scores between and within the 
two scales, it is readily apparent that there 
is much agreement between their results and 
ours. 


Summary and Conclusions 


This study compared the WAIS and W-B I 
scales with respect to score equivalence and 
effects of practice. The results with regard to 
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score equivalence were as follows. An analysis 
of variance revealed significant differences be- 
tween the two scales on Information, Compre- 
hension, Digit Span, Block Design and Digit 
Symbol, as well as on the Performance and 
Full Scale IQ’s. The three verbal subtests 
were significantly higher on WAIS than on 
the corresponding W-B I subtests, while the 
two performance subtests and two IQ meas- 
ures were significantly higher on W-B I. 

On the whole, subjects tended to retain the 
same relative rank on both tests, although this 
was not true for Arithmetic, Object Assembly, 
Performance IQ or Full Scale IQ when WAIS 
was administered first, or for Comprehension 
and Object Assembly when W-B I was given 
initially. The average correlations between the 
two tests are not regarded as high enough to 
warrant their being used interchangeably be- 
cause of the relatively low correlations be- 
tween several of the subtests, notably, Arith- 
metic, Object Assembly, Comprehension, and 
Picture Arrangement, as well as between the 
Performance IQ’s and the Full Scale IQ’s. 

Practice effects were also found regardless 
of which scale was administered first on all of 
the verbal and performance subtests, as well 
as on the Verbal IQ, Performance IQ, and Full 
Scale IQ. Scale-by-practice interaction re- 
vealed significant variation only for Informa- 
tion. Apparently, the Information subtest on 
WAIS yields significantly higher test scores 
than its corresponding subtest on W-B I 
independent of order of administration and 
practice effects. In view of the findings of 
this study it is concluded that W-B I is not 
a satisfactory alternate for the WAIS and 
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that a need for an alternate form of the WAIS 
is indicated. 

Although the results reported here are at- 
tenuated in comparison with the WAIS stand- 
ardization sample, our findings reveal a num- 
ber of important differences between the two 
scales which might be critical in clinical and 
research work. As with all new tests, a con- 
siderable body of statistical information bear- 
ing on validity with practical criteria must be 
accomplished before full acceptance can be 
granted for the WAIS. In the meantime, the 
improved standardization of the WAIS is a 
strong argument in favor of its adoption by 
clinical psychologists. 


Received September 5, 1956. 
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A Comparison of Two Methods of Estimating 
Full Scale IQ From an Abbreviated WAIS 


Philip Himelstein* 


Air Force Personnel and Training Research Center 


Recent studies have shown that there is 
a very dependable relationship between Full 
Scale WAIS IQ and the Doppelt regression 
equation (1) for both normal and psychiatric 
groups. In using the Doppelt equation, the 
scaled scores of the Arithmetic, Vocabulary, 
Block Design, and Picture Arrangement sub- 
tests are added, and this sum is multiplied by 
2.5. A constant based on the subject’s age is 
added to the product to determine the esti- 
mated Full-Scale Score. The purpose of this 
study is to compare the efficiency of this re- 
gression equation with the more usual clinical 
procedure of prorating the scores of the same 
four subtests. 

The subjects consisted of 61 hospitalized 
psychiatric patients who were tested on the 
admission service of a psychiatric hospital 
with a full WAIS. Of the 61 patients, 29 were 
diagnosed schizophrenic reaction and 12 were 
brain damaged. The WAIS was scored three 
times: the usual calculation procedure de- 
scribed in the manual (2), with the Doppelt 
equation, and the Verbal and Performance 
scores prorated separately and then summed 
to estimate Full Scale Score. 

The mean Full Scale IQ of the present 
group was 87.3. The means of both the re- 


1An extended report of this study may be ob- 
tained without charge from Philip Himelstein, Per- 
sonnel Laboratory, Air Force Personnel and Training 
Research Center, Box 1557, Lackland AFB, Texas, 
or for a fee from the American Documentation In- 
stitute. Order Document No. 5102, remitting $1.25 
for microfilm or $1.25 for photocopies. 





gression and proration methods of scoring 
were an identical 88.5. The average deviation 
from Full Scale IQ obtained with the regres- 
sion equation was 4.0, and for the prorated 
scores, it was 4.5. 

When the Full Scale IQ (including the 
four subtests) was correlated with IQs ob- 
tained with the regression equation, a coeffi- 
cient of .954 was obtained. Full-Scale and 
prorated IQs correlated .953. The two meth- 
ods of estimating Full Scale IQ correlated 
.988 with each other. 

This study indicates that, for the four 
WAIS subtests investigated, both the regres- 
sion equation and the proration method of 
estimating Full-Scale WAIS scores are about 
equal in effectiveness. The clinician should 
feel free to use the method with which he 
is most comfortable. It would appear that 
Wechsler (2, p. 31) was overly cautious in 
warning against the use of less than five Ver- 
bal subtests and four Performance subtests in 
a prorating procedure. 


Brief Report. 
Received January 8, 1957. 
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The Validity of the Barron Ego-Strength Scale 
and the Welsh Anxiety Index’ 


Ronald Taft 


University of Western Australia 


This paper represents an attempt to confirm 
the validity of two MMPI scales that have 
been proposed for distinguishing normals from 
persons with pathological symptoms. The 
scales are the Barron Ego-strength scale (1) 
and the Welsh Anxiety Index (5), both of 
which have been shown by their authors to 
distinguish normal subjects from several 
groups of psychiatric patients (see also 3). 
The question being asked here is whether 
these indices generalize so widely that they 
hold up in a different culture to that of the 
U.S.A.; namely, in Australia. It has already 
been demonstrated (4) that, with a normal 
group, the only possibly important differences 
on the MMPI scales are higher Australian 
scores on Mf and Pd for males and Mf, D 
and Sc for females. 


Subjects 
Normals 


The group taken as representative of the 
normals consisted of 40 male and 10 female 
first-year psychology students. These subjects 
were administered the MMPI test together 
with the rest of their class, as part of their 
regular course work. In order to facilitate 
matching with the clinical sample, only stu- 
dents aged over 20 years were used and the 
50 subjects were chosen at random from this 
pool. Some results, however, will also be 
quoted from the other students. 


Patients 


The psychiatric patients were drawn from 
a pool of outpatients and inpatients at a re- 


1 The author thanks Miss Ronnie Jennings for as- 
sistance with the computational work for this report, 
and Mr. J. R. E. White for making available the 
psychiatric records. 


patriation (veterans) hospital. All patients at- 
tending the psychiatric ward of the hospital 
were given the MMPI by routine. The out- 
and inpatients were about equal in number. 
Approximately two-thirds of the patients were 
diagnosed as psychoneurotic and psychopathic 
personality and the remainder as _ psycho- 
somatic or psychotic cases. The sample used 
in this present study was selected to match 
the student sample on sex, age, and educa- 
tional level as far as possible. The matching 
data are reproduced in Table 1. 


Table 1 


Characteristics of Australian Sample 


Number Age 
- —_ —_— Minimum 
Group Male Female Mean Range education 
Students 40 10 32 21-54 12 years 
Patients 40 10 35 20-54 10 years 
Results 


For comparative purposes I have averaged 
all the relevant means and standard devia- 
tions published in the articles by Barron (1) 
and Welsh (5). In the case of the Ego- 
strength scale (Table 2) the U. S. mean is 
taken as the mean of the six studies quoted 
by Barron that refer to psychiatric samples 
(363 cases). The standard deviation repre- 
sents the median of the SD in these studies. 
At least one of Barron’s samples included 
both men and women, and the sex distribut- 
tion in the other clinical samples is unknown. 
A study by Quay (3) using student nurses as 
the normal group and female inpatients as 
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Table 2 





Comparative Scores on the Barron Ego-Strength Scale 











Male Female Mixed 
Sample M SD M SD M SD 
U.S. A. Normals 
1. Barron: Graduate students (WV = 40) 40.9 5.6 
2. Barron: Air Force Officers (V = 60) 52.7 4.0 
3. Quay: Student nurses (V = 92) 45.1 5.8 
Australian Normals 
4. Taft: 1st year Psychology (matched group) 47.4 4.7 
(N = 50) 
5. Taft: lst year Psychology (unmatched group) 48.8 4.0 42.2 5.0 46.9 4.3 
(NV = 50 males, 20 females) 
U.S. A. Psychiatric Patients 
6. Barron (6 samples) (V = 363) 41.5 7.9 
7. Quay (N = 74) 37.5 7.2 
Australian Patients 
8. Taft—matched group 38.6 9.7 33.2 10.0 37.5 9.8 


(NV = 40 males, 10 females) 





the clinical group is also quoted as our re- 
sults suggest that females score lower than 
males on the Ego-strength scale. 

The comparative results on Ego-strength 
(Es) are reported in Table 2. 

A similar comparison is made in Table 3 


between U. S. and Australian results for the 
Anxiety Index (Al). The U. S. figures were 
derived by obtaining the mean of all of the 
relevant samples quoted in (2) and (5). Psy- 
choneurotic cases described as “mild” have 
been omitted. 


Table 3 





Sample 


U. S. Normals 
1. Goodstein—Male students* 


2. Black—Female students* 
3. Welsh—Mean of 13 normal samples 


Australian Normals 

4. Taft—Matched sample 

5. Taft—Unmatched sample of students 
Male 
Female 
<25 yrs 
>25 yrs 
Total 


U. S. Psychiatric Patients 
6. Welsh—Mean of 24 samples 


Australian Patients 
7. Taft—Matched sample 





Comparative Scores on the Welsh Anxiety Index 











N M 
> 5500 55.1 
> 5000 50.0 
<1700 48.6 (SD of one sample, 16.0) 
50 50.6 (SD, 17.2) 
65 54.6 
67 51.9 
85 55.7 
47 48.6 
132 53.2 
>800 79 (SD, 27-31) 
50 78.6 (SD, 31.0) 





*® Computed from figures quoted in (2) representing the combined means of varied student samples. 











Validity of Ego-strength Scale and Anxiety Index 


The following summarizes the findings: 


1. The Es scale distinguishes the Australian 
normals (sample 4) from the matched sam- 
ple of psychiatric patients (¢ is significant at 
the .01 level). Taking 43 as the breaking 
point score, 78 per cent of the normals ex- 
ceeded this compared with 24 per cent of the 
patients. 

2. The AJ distinguishes the Australian nor- 
mals (sample 4) from the matched sample of 
psychiatric patients (¢ is significant at the 
0.01 level). Taking 66 as the breaking point 
score, 18 per cent of the normals exceeded this 
compared with 62 per cent of the patients. 

3. The scores for Australian male freshmen 
students on the Es scale were 2.1 points be- 
low the U. S. graduate students, significant 
at the .05 level. 

4. The scores for Australian female fresh- 
men students on the Es scale were 2.9 points 
below the U. S. student nurses, significant at 
the .05 level. 

5. The scores for the Australian psychiatric 
patients on the Es scale were 4.0 points be- 
low the U. S. patients, significant at the .01 
level. 

6. The mean AJ scores for the Australian 
male and female students (sample 5) did not 
differ from the corresponding U. S. subjects 
(samples 1 and 2). 

7. The mean AJ for the Australian patients 
was virtually the same as that for the U. S. 
patients. For the three samples quoted (NV = 
223), 76 per cent of Welsh’s patients exceeded 
a score of 60, compared with 64 per cent of 
the Australian patients. This difference is not 
significant. 

8. The Australian male students obtained 
higher Es scores than the female, and Bar- 
ron’s male graduate students obtained higher 
scores than Quay’s female nurses. Both of 
these differences are significant at the .01 
level. 

9. The Australian male students obtained 
higher AJ scores than the female, but this dif- 
ference is not significant. In the U. S. samples 
of students, the males also obtained substan- 
tially higher AJ scores (55.1 versus 50.0). 

10. On the Australian student sample (sam- 
ple 5), the younger subjects (under 25) ob- 
tained a higher AJ than the older, significant 
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at the 0.05 level. This is consistent with the 
finding of Wauck (5, pp. 69-70) that the AJ 
of schizophrenic patients decreases with age. 


Summary and Conclusions 


The validation of the Barron Ego-strength 
scale and the Welsh Anxiety Index was 
checked by comparing a sample of 50 Aus- 
tralian students with a matched sample of 50 
patients in a psychiatric clinic. The following 
represent the overall conclusions: 


1. The validity of the Es scale is confirmed 
in that it successfully distinguishes the nor- 
mal from the clinical subjects. Its validity 
thus generalizes beyond the American culture. 

2. The norms for Australian students and 
patients are consistently below the Ameri- 
cans. Since this applies to patients as well as 
both male and female students, this suggests 
that the significance of some of the items 
varies with the culture, rather than that the 
differences in the Es scores reflect real per- 
sonality differences between Australians and 
Americans. The latter is possible, however. 
(See the discussion on this point in 4.) 

3. Males score higher than females on Es. 

4. The validity of the AJ is confirmed in 
that it successfully distinguishes the normal 
from the clinical sample. Its validity thus 
generalizes beyond the American culture. 

5. The norms for the Australian students 
and patients do not differ from the American 
norms for the AJ. 

6. Male students obtain higher A/ scores 
than female. 

7. The Al decreases with age. 
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Subtle and Obvious Test Items and Response Set 
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Recent research (6, 7, 9, 10, 11, 12) with 
the subtle and obvious scales for the Min- 
nesota Multiphasic Personality Inventory 
(MMPI) raises some important interpretative 
problems. While it is not puzzling to find that 
hospitalized psychiatric patients and other 
maladjusted and unsuccessful groups obtain 
high T scores on the obvious scales, it is dis- 
concerting to find that groups of recovered 
psychiatric patients, successful trainees, suc- 
cessful salesmen, and college sophomores ob- 
tain higher T scores on the subtle scales than 
the “normal’’ MMPI population, and higher 
T scores than groups of unrecovered psychi- 
atric patients, unsuccessful trainees, etc. Since 
each of the items in the subtle scales was 
originally selected by Hathaway and Mc- 
Kinley because it discriminated between nor- 
mal and abnormal groups, it appears that 
there is something common either to the items 


or to the groups which influences the size of 
T scores more than does the original discrimi- 
nating power of the items. 

It is quite possible that groups obtaining 
high 7 scores on the subtle scales have a re- 
sponse set to answer “false” to the items in 
the MMPI and consequently are “caught” on 
a disproportionate number of items which are 
scored false (2). That this explanation is 
plausible can be seen from Table 1. It shows 
that Wiener’s division of the MMPI items 
(10, 12) for five clinical scales into subtle (S) 
and obvious (O) items tends to make “false”’ 
the scored response for the S items. Four of 
the five S scales have a large majority of their 
items scored false; only the S-Ma scale has 
more true than false responses. It is of inter- 
est that it is the S-Ma scale which does not 
separate the groups studied by Wiener (10, 
11). As well as the number and percentage of 


Table 1 














Scored Responses for Five MMPI Scales Containing Subtle and Obvious Items 








MMPI scale 











Pd Pa Ma 

Scored response T S O T S O = S O T S 0 (ol ee 

True 20 3 17 13 1 12 24 4 20 25 5 20 35 15 20 
False 40 17 23 47 27 20 26 i8 §& 15 12 3 11 8 3 
Total 60 20 40 —. 2s 8 50 22 28 - it 2 46 23 23 
Percentage true 33 15 43 22 4 38 48 18 71 63 29 87 76 65 87 
Percentagefalse 67 85 57 78 96 62 52 82 29 a: «Fin: 24 35 13 
Percentage diff. -—34 —70 —14 —56 —92 —24 —4 —64 42 26 —42 74 52 30 74 


T score when all 
marked true 58 21 83 44 26 
T score when all 


marked false 106 77 ~=—: 1101 106 83 
T score 
difference —48 —56 —18 —62 -—57 — 


71 75 33 94 100 47 120 97 74 106 
92 81 86 58 70 76 54 43 59 26 


21 —-6 -—53 36 30 —29 66 54 15 80 
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Subtle and Obvious Test Items and Response Set 


Table 2 


Correlations Between K and Subtle and 
Obvious MMPI Scales 
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the K scale (having 29 of its 30 items scored 
false) was essentially a measure of the re- 
sponse set to answer false to personality test 


(N = 144) items. If K is a measure of this response set, 

. ae — = __ then according to the true-false imbalance in 

MMPI scale the S and O scales in Table 1, it would be ex- 

Item pected that the S scales, with the exception of 
Senne a ay Pa Ma §$-Ma, would correlate positively with K, and 
Subtle 67 ae 38 %6 02 the O scales would correlate negatively with 
Obvious —~5i —48 —60 —.50 —.65 4K. The correlation coefficients (7) for a sam- 





scored responses, Table 1 gives the standard 
score a person would have if all items are 
marked true or if all items are marked false. 
The magnitude of the differences in percent- 
ages and standard scores should be noted. 
The small number of items in each of the S 
and O scales and possible inaccuracies in clas- 
sifying the items as S and O° should be con- 
sidered in evaluating the response set explana- 
tion. 

It has been previously suggested (2) that 


1 Subtlety-obviousness for the five scales was de- 
termined rationally and not empirically, by Wiener; 
subtle and obvious scales were not formed for the 
other clinical scales because Wiener felt that they 
contained too few subtle items. 


ple of 144 out-patient psychiatric patients are 
shown in Table 2. Nine of the ten expecta- 
tions are fulfilled; only the S-Ma scale has 
misbehaved, but only by a small amount. It 
is probable that K or some other response-set 
scale * could be used as a suppressor variable 
to improve the validity of the S and O scales 
(i.e., scores from a scale such as K would 
have to be subtracted from the scores of four 


2 The writer is grateful to Drs. Sherman E. Nelson 
and Daniel N. Wiener for supplying the scores from 
which the r’s were computed. Dr. Nelson indicated 
in a personal communication that the sample con- 
sisted of serial cases who came for treatment to the 
Veterans Administration’s Mental Hygiene Clinic in 
St. Paul, Minnesota. 

$A scale which may be useful for this purpose is 
described in “A Response Bias (B) Scale for the 
MMPI.” J. counsel. Psychol., 1957, 4, in press 


Table 3 








Correlations Between K and Uncorrected MMPI Clinical Scales 





MMPI scale 


D Hy 





Sample Sex N Pd Pa Va 
Normals 
Graduates (4) M 100 08 53 09 22 42 
College (8) M 112 —07 50 —09 19 30 
College (1) M 179 —20 21 — 26 —12 — 37 
Normal (5) M 100 15 48 -17 —07 ~ 36 
Normal (5) F 100 —03 30 —06 02 —28 
Median correlation coefficient —03 48 —09 —(2 36 
Abnormals 
NP patients (8) M 110 —19 14 — 38 —26 -45 
NP patients (13) M 100 —04 15 —16 —15 —10 
NP patients M 144 —27 10 —34 —27 —45 
Abnormals (5) M 100 —29 11 -26 —19 -37 
Abnormals (5) F 100 —16 17 —21 —13 38 
Hysterics (3) F 63 —19 04 —30 —12 ~42 
Median correlation coefficient —19 13 —28 —17 —40 
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of the S scales, and would have to be added 
to all of the O scales). 

On the basis of some data and theoretical 
considerations, Wiener (10) suggested that 
the S items are best for assessing the person- 
ality of normal persons and that the O items 
are best for abnormal persons. If it is true 
that S items are more likely to function in a 
normal population and O items in an abnor- 
mal population, then according to the present 
contention that K operates as a measure of 
response set, a difference in r’s should be re- 
flected in normal and abnormal groups. Spe- 
cifically it would be expected that the 7’s of 
K with the full scales for D, Hy, Pd, Pa, and 
Ma would be more positive, or less negative, 
in a normal than in an abnormal population. 
The r’s presented in Table 3 reveal that four 
of the five expectations are fulfilled; only 
the r’s between K and Ma are not substan- 
tially different. The data summarized here up- 
hold Wiener’s contention that different items 
function in normal and abnormal groups. A 
tentative hypothesis drawn from the data is 
that the more positive the correlation between 
scores from a measure of response set to an- 
swer false and scores from clinical scales (com- 
posed of subtle and obvious items), the more 
likely it is that the group is well adjusted or 
successful. More research, preferably with an 
instrument other than the MMPI, is needed 
to further test the hypothesis. 


Summary 


A response-set explanation was offered to 
account for the repeated finding that rela- 
tively well-adjusted and successful persons 
obtain more abnormal scores on the subtle 
scales of the MMPI than maladjusted and 
unsuccessful persons. Evidence was assem- 
bled through an analysis of the subtle and 
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obvious scales which supports the previously 
formulated (2) response-set interpretation of 
the K scale. 
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A most common feature of personality the- 
ory and projective testing is the assumption 
of some degree of stimulus equivalence be- 
tween father and other persons in authority 
roles. That such should be the case would be 
predicted from most theories of learning and 
perception. The truth of the assumption seems 
to be confirmed daily in clinical practice. Few 
topics could bring more unity of prediction 
from competing psychological theories. And 
yet an effort to test the hypothesis in a quan- 
titative fashion utilizing a series of measures 
of attitude toward father, boss, and fellow 
worker failed utterly (1). The present study 
represents another effort toward the same end. 

One person’s characterization of another 
contains both systematic variance correlated 
with the person being described and system- 
atic variance correlated with the person doing 
the describing. The first we regard as accurate 
knowledge, the second is often characterized 
as projected content, reflecting the personality 
of the judge or the residues of prior experience 
in which the person being judged was not in- 
volved. In conditioning language, the response 
to the stimulus of another person represents 
in part habits learned through the rein- 
forcement of prior responses to that specific 
person (the valid component), and in part 


1 This study was supported in part by the United 
States Air Force under contract No. AF 18(600)-—170 
monitored by the Crew Research Laboratory, Air 
Force Personnel and Training Research Center, Ran- 
dolph Air Force Base, Randolph Field, Texas. Per- 
mission is granted for reproduction, translation, pub- 
lication, use and disposal in whole and in part by or 
for the United States Government. 


response tendencies coming from learning 
situations involving other persons, transferred 
to the present person on the basis of stimulus 
generalization (the projected component). In 
the terms of one or another cognition theory, 
the percept or concept of the other person 
represents in part a veridical component 
(greater the more structured the situation or 
the more reality testing that has taken place) 
and in part meaning introduced through as 
siinilation of the percept to preexisting sche- 
mata, sets, hypotheses, attitudes, archetypes, 
or the like. For either point of view, or for 
earlier psychoanalytic notions, the projected 
component would tend to make some of the 
descriptions which one person makes of sev 
eral others more similar than the “real facts” 
of their personalities would justify. Such simi- 
larity in trait descriptions is employed in this 
study to infer generalized response tendencies 
or attitudes. 

Let us consider a person’s description of 
his boss, of a subordinate whom he 
vises, of his father, and of his younger 
brother. “Projected” or “generalized” vari- 
ance might be expected to spuriously increase 
the similarity of the descriptions in terms of 
two pairs of attitudes: The similarity of de- 
scription between father and boss should be 
enhanced by any generalized attitude toward 
authority figures, or any generalization to cur- 
rent authorities of responses learned to the 
father in childhood. And while the status of 
siblings is not as clearly hierarchical as is the 
employment situation, one might expect that, 
through similarity of role, the subordinate 


super- 
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would elicit responses originally learned to the 
younger brother, generating some unwarranted 
similarity between the descriptions of the two. 
In orthogonal fashion, boss and subordinate 
both come from the current work-a-day world, 
and might be jointly influenced by general 
attitudes toward the work situation. Like- 
wise, father and younger brother descriptions 
should both be affected by any generalized 
attitudes toward the childhood home and 
family. In any single set of descriptions the 
tendencies created by such attitudes would 
be overshadowed by actual differences in the 
personalities involved, and by other specific 
sources of variance—but in a series of com- 
parisons, these slight tendencies should show 
some overall effect. 


Method 
As a part of a larger study attempting to 
measure generalized attitudes towards su- 


periors and subordinates, men of various Air 
Force samples filled out a questionnaire en- 
titled “Description of Self and Others.” The 
instrument provided a list of thirty person- 
ality traits. The respondent was asked to de- 
scribe a number of persons by use of these 
traits, picking the 8 best fitting traits and 
the 8 terms most definitely not appropriate, 
leaving the 14 intermediate terms unmarked. 
Where appropriate to his situation, the re- 
spondent described, among others, his father, 
his next younger brother, his immediate su- 
pervisor (or “boss’’) and a subordinate, a per- 
son whom he supervised. It was judged maxi- 
mally efficient to limit the initial analysis to 
respondents who described all four of this 
symmetrical set of persons. From 91 preflight 
cadets tested at Lackland Air Force Base in 
January 1953, 15 such persons were obtained. 
From 70 officers of B-29 Bomber crews tested 
at Randolph Air Force Base, 17 met the re- 
quirements. From 77 enlisted bomber crew 
members, 17 cases were obtained. 

For each case an index was computed to 
express the similarity between pairs of de- 
scriptions. For this purpose, a Q-type (3, 4) 
correlation coefficient was used, with an N of 
30 traits scored on a three-step scale, “doesn’t 
apply,” “omitted,” and “applies.” The forced 
distribution of trait assignments provides 
equated means and deviations for each de- 
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Fig. 1. Analytic dimensions and mean similarity of 


descriptions for the basic 49 cases 


scription, simplifying the computations. For 
each respondent, six such coefficients were 
computed, expressing the similarity of de- 
scription for all six pairings of four persons 
described. The relative size of these coeffi- 
cients provides the argument of this paper. 
While in general the descriptions were favor- 
able and only 9 per cent of the coefficients 
negative, sufficient range was obtained to 
make the comparisons seem appropriate. 
The comparisons to be made are indicated 
in the four-cornered, two-dimensional schema 
shown in Fig. 1. Pairs of persons in the same 
column, or in the same row, have one of the 
four hypothesized general attitudes in com- 
mon, and consequently should show a higher 
degree of correlation than is found for pairs 
diagonal from each other. This is not to say 
that the correlation between diagonal pairs 
represents only “true” similarity, for there 
are undoubtedly still more general attitudes 
or response sets which make all four of the 
descriptions more similar than they should be. 
A general attitude toward other men, or a 
still more general philanthropy-misanthropy 
attitude, or predilections for certain trait 
terms, all would lead to unwarranted simi- 
larity among descriptions. But the row and 
column pairs should have the additional simi- 
larity coming from the more specific attitude 











Stimulus Equivalence among Authority Figures 


under study, and this greater similarity is the 
object of the present search. Figure 1 indi- 
cates the mean correlation between each of 
the possible pairs of the four descriptions for 
the 49 subjects (Ss) (computed by the 2’ 
transformation). As a means of evaluating the 
significance of the differences between the 
various average values, sign tests, based upon 
the frequency one value exceeded the other 
for individual Ss, have been made. These com- 
parisons are summarized in Table 1, and will 
be explained in more detail below. 


Results 


Generalized attitude toward authority. In 
terms of the analysis presented above, if there 
are generalized attitudes toward persons in 
authority, then descriptions of Father should 
correlate more highly with descriptions of 
Boss than with descriptions of Subordinate. 
The trends in the mean rs and in the tallies 
of Table 1 are in the expected direction, 
reaching the .02 level of significance. By the 
same reasoning, Boss should correlate more 
highly with Father than with Younger 
Brother. Again the findings are positive, at 
the .03 level. Although the results are con- 
firmatory, the magnitude of the correlation 
difference and the level of significance are 
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probably less than one would have expected 
from such an ubiquitous hypothesis. 

Generalized attitude toward subordinates. 
If there are generalized attitudes toward sub- 
ordinates, then Younger Brother should cor- 
relate more highly with Subordinate than with 
Boss. While mean values show a slight sup- 
porting trend, the talley of individual trends 
is in the opposite direction. Similarly, Sub- 
ordinate should correlate more highly with 
Younger Brother than with Father. No sig- 
nificant tendency in this direction is found. 

Generalized attitude toward family. Atti- 
tude toward family should appear in Father 
correlating more highly with Younger Brother 
than with Subordinate. Similarly, Younger 
Brother should correlate more highly with 
Father than with Boss. Chance levels are ob- 
tained for both predictions. 

Generalized attitude toward current asso- 
ciates. Over-all attitudes toward the current 
work situation such as would influence the 
evaluation of persons both above and below 
in the table of organization would lead to the 
prediction that Boss would correlate with 
Subordinate more highly than Boss would 
correlate with Younger Brother. And Sub- 
ordinate should correlate higher with Boss 
than with Father. While the means in Fig. 1 


Table 1 


Proportions of Comparisons of Coefficients in Expected Direction 





Variables compared 


Authority Attitude 





Father-Boss > Father-Subordinate 
Father-Boss > Boss-Younger Brother 


Subordinate Attitude 


Younger Broth.-Sub. > Younger Broth.-Boss 
Younger Broth.-Sub. > Subordinate-Father 


Family Attitude 


Father-Younger Broth. > Father-Subordinate 
Father-Younger Broth. > Younger Broth.-Boss 


Work Situation Attitude 


Boss-Subordinate > Boss-Younger Brother 
Boss-Subordinate > Subordinate-Father 


Bomber 
Bomber enlisted 
Cadets officers men Total 
(15) (17) 17) 49 t 
11/15 12/17 7/13 30/45 02 
8/12 8/14 11/15 27/41 03 
3/15 7/15 7/15 17/45 
5/11 6/14 8/15 19/40 
11/15 4/13 6/15 21/43 
8/15 8/17 8/16 24/48 
6/14 9/17 10/17 25/48 
7/14 7/16 6/15 20/45 





Note.—The denominators represent the total of the cases not tied, and are thus frequently smaller than the total N indicated 


at the column heads. 
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support the hypothesis, it receives chance con- 
firmation or less in the individual tallies of 
Table 1. 

Extension of father-boss comparisons. Be- 
cause the strongest trends obtained were for 
the prediction most strongly presaged in psy- 
chological theory, it seemed appropriate to 
extend the number of cases on this point by 
including those where full symmetry of de- 
scriptive pattern was absent. From the appro- 
priate remaining cases in the three samples, 
comparisons were made for the prediction that 
Father and Boss should correlate higher than 
Father and Subordinate. This prediction was 
confirmed in 43 out of 70 untied instances, 
giving a one-tailed p value of .04. Combining 
this with the 30 confirmations out of 45 in- 
stances reported in Table 1 for the same com- 
parison, gives an over-all score of 73/115 
which has a one-tailed » value of .003. Only 
eight additional instances were available for 
the prediction, Father correlates higher with 
Boss than does Boss with Younger Brother. 
For these, the ratio of confirmations was only 
3/8. Adding these 8 to the 41 in Table 1 gives 
a total score of 30/49, which is significant at 
only the .08 level by a one-tailed test. 


Some other relevant comparisons are generated by 
substituting other persons for the Subordinate or 
Younger Brother, to control for the family, job situa- 
tion, and general halo factors. Thus, for cases where 
there is no Younger Brother, the description of an 
Older Brother should correlate less highly with Boss 
than does Father with Boss, although inasmuch as 
Older Brother has some authority role, the contrast 
should be less sharp. For 38 such untied instances, 
the prediction is confirmed in 23, giving a p value of 
.13. In addition, the Father-Boss correlation should 
be higher than that between Father and Peer (“one 
of your equals with whom you have closest deal- 
ings”). This prediction is confirmed in only 37 out 
of 85 untied cases (including cases from the original 
49). Not only does this finding fail to confirm the 
hypothesis, but it shows a trend in the opposite di- 
rection. It is difficult to explain this failure in the 
face of the other supporting data. 

It seemed possible that the correlation between de- 
scriptions might be a function of the degree of fa- 
vorableness of the descriptions, and that if Peers 
were judged more favorably than Bosses or Subordi- 
nates, they might correlate more highly with Father 
for that reason. To check, the descriptions of the 
basic 49 cases were scored for favorableness. Sur- 
prisingly enough, Younger Brothers were described 
most favorably, Peers least favorably, with Bosses, 
Subordinates, and Fathers between at about the 
same level. Thus degrees of favorableness cannot pro- 
vide an explanation for these inconsistent data. 
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For middle class boys the Mother is often more 
frequently the work-supervisor than the Father. If 
generalization is based upon role similarity, one might 
expect Mother-Boss correlations to be higher than 
Mother-Subordinate and Mother-Peer correlations. 
This predicticn was confirmed in only 33 out of 65 
instances for Mother-Subordinate, showing no rela- 
tionship. For Mother-Boss greater than Mother-Peer 
(computed only where no Subordinate was de- 
scribed) the score is only 18/48, significant in the 
unpredicted direction at the .12 level by a two- 
tailed test. A test of the expectation that Father- 
Boss would correlate higher than Mother-Boss was 
confirmed in only 59 out of 112 instances, surpris- 
ingly enough. This perhaps is an illustration of a 
relative unimportance of sex as a source of attitude 
generalization, seemingly found in a study of judg- 
ments of photographs (2). 


Summary and Discussion 


The most dependable empirical relationship 
found is for descriptions made of Father and 
Boss to be more similar than those made of 
Father and Subordinate (p < .003). This may 
tentatively be interpreted as supporting the 
notion of a common attitude toward authority 
encompassing both Father and Boss, or evi- 
dence of stimulus generalization. Supporting 
this interpretation is the tendency for Father 
and Boss to correlate more highly than Boss 
and Younger Brother (p < .08), and Boss 
and Older Brother (p < .13). Inexplicably, 
Father and Boss descriptions are no more 
similar than Father and Peer descriptions. In 
the absence of any single integrated explana- 
tion of the findings, these data are inter- 
preted as supporting, with exceptions, the 
initial hypothesis. The differences between 
correlations are, however, very small. 

The study failed to find any evidence for 
generalized attitudes toward subordinates (in- 
cluding younger brothers), family, or work 
colleagues. 


Received September 11, 1956. 
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Specific Behavior Changes Following Chlorpromazine 


S. D. Porteus’* 


Territorial Hospital, Kaneohe, Hawaii 


Though there is no dearth of studies on the 
effects of chlorpromazine with psychotic pa- 
tients, few of them report the data in quanti- 
fied form, or deal with specific alterations of 
behavior. Many investigators give the percent- 
ages of cases slightly, moderately, or markedly 
improved, but it is difficult te determine the 
basis of these judgments. Under such circum- 
stances, it is possible for bias to operate, and 
in the case of the “tranquilizing” drugs, as 
with other new approaches to complex prob- 
lems, such bias unfortunately exists. Medical 
opinion seems to range from enthusiastic ap- 
proval to frank scepticism, but, without re- 
search data, either attitude is unjustifiable. 

Rather heroic efforts have been made by 
some devisers of rating scales of behavior to 
objectify their observations. Untidiness, for 
example, has been rated in terms of use of 
eating utensils or specific breakdowns in good 
toilet habits; aggressiveness in terms of num- 
ber of verbal or physical assaults, or indi- 
vidual sedations, or seclusive periods neces- 
sary. Ideally, such scales are superior to those 
that make use of generalized subjective char- 
acterizations. But, from the practical stand- 
point, they require more detailed and faithful 
reporting by psychiatric aides, or more super- 
visory checks, than the limited staffs of many 
state hospitals can afford. Moreover, the cata- 
loguing and recording of overt incidents may 
give only the bare outlines of the behavior but 


1 The author’s thanks are due to Smith, Kline and 
French, Philadelphia, for their generous donation of 
the drugs used in this study; to John E. Barclay, re- 
search assistant, for invaluable help in scoring and 
tabulation of results; and to Robert Kimmich, M.D., 
medical director, Robert Spencer, M.D., clinical di- 
rector, and John Regan, M.D., psychiatrist, for their 
advice and support in developing this research; also 
to the James McKeen Cattell Fund for partial sup- 
port. 


none of the shading. And as we well know, 
an impressionistic painting, even if somewhat 
blurred, may give a far more adequate picture 
than a line drawing. More concretely, an ag- 
gressive patient may never strike a blow; ex- 
treme depression may be often judged by what 
a person does not do; and the same is true of 
negativism, or withdrawal. 

Moreover, one may start out with a theo- 
retically excellent research design and a fine 
system of checks and balances, but it is an- 
other problem to keep the system in opera- 
tion. Patients drop out of the experimental 
group, supervisory personnel changes take 
place; so that the initial plans for > long- 
term study must undergo alteration. We have 
found it wise to keep the research tools as 
simple in operation as possible. 


Need for Specific Information 


We need specific information as to what 
behavior traits have been changed through 
drug administration, and to what degree, and 
in which directions. It would be idle to as- 
sume that all types of behavior are equally 
improved by the use of a drug such as chlor- 
promazine. It is this neglect to analyze im- 
provement which is responsible for an incor- 
rect labelling of the drugs as “tranquilizing”’ 
or “ataractic.” If it can be shown that in some 
ways the effect is stimulating rather than de- 
pressant, then the descriptive term equilibrat- 
ing or stabilizing would be more appropriate. 
Rating behavior by separate traits would seem 
to be a necessary complication of the research 
design. 

An important aspect of drug therapy is the 
time factor. It is important for both the phy- 
sician and the research worker to have more 
information as to the course of improvement, 
particularly with regard to possible plateaus 
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or even regressions. Such levelling-off periods 
and regressions are of common occurrence in 
the learning process, and it is only reasonable 
to suppose that they also occur in physiologi- 
cal habituation. According to our experience, 
plateaus do occur when the course of both in- 
dividual and group improvement are plotted. 
They may be best considered as not only re- 
lated to physiological adjustment of patients, 
but to the suggestibility of both observers and 
patients. 

Obviously, treatment and rating of effects 
should be continued long enough to span pe- 
riods of alternating rapid and retarded im- 
provement; otherwise, the treatment may be 
too enthusiastically accepted on the basis of 
initial gains, or regarded as ineffective be- 
cause of patients’ inability to sustain improve- 
ment. The profile of progress, both individual 
and group, does exhibit these ups and downs, 
but whether they follow a regular rhythm can 
only be determined by further research. The 
effect of these fluctuations on suggestibility 
will be discussed later. 


A Ward Behavior Scale 


Our method of rating behavior changes fol- 
lowing routine chlorpromazine treatment has 
been by means of a specially devised graphic 
rating scale. The scale covers eleven traits or 
trait complexes: aggressive or destructive be- 
havior, negativism or lack of cooperation, 
speech disorders, untidiness, restlessness or 
physical over-activity, hallucinations, delu- 
sions, emotionality, mental confusion, degree 
of asocialization and compulsive behavior. 
The ratings extend from mildness or nonap- 
pearance as marked on the left of a six-inch 
line, up to very excessive appearance on the 
extreme right. 

The fact that speech disorders, emotion- 
ality, and socialization are bipolar necessitates 
the use of alternative scales, one for elation 
and one for depression in the case of emo- 
tionality, one for mutism and the other for 
loquacity in speech disorders, one for social 
withdrawal and the other for obtrusiveness 
in socialization. This avoids the difficulty of 
making ordinary behavior the midpoint, and 
then marking the bipolar extremes on either 
side, a system which would prevent the sum- 
mation of rating into total scores. Under our 


scheme, the graphic ratings can be converted 
into numerical equivalents and the scores 
summated. 


Subjective Judgments of Behavior 


Rating scales are often down-graded in psy- 
chological estimation because of their subjec- 
tivity. Their devisers are plagued with prob- 
lems similar to those that beset the path of 
those who arrange intelligence scales. We are 
inclined to dignify the last named by calling 
them objective measures when we mean the 
measures are objectively applied. This confers 
the advantage that two examiners should get 
approximately the same results. But this 
agreement does not mean that the measures 
they apply are therefore the more valid. It is 
well to know that the two examiners may 
agree, but they may agree in being wrong. 

The basic problem lies in the fact that 
neither intelligence nor behavior is unitary; 
both have many aspects or facets. The men- 
tal tester is wont to assume a generality of 
reference for his tests that is greater than the 
evidence of validity warrants. The test de- 
viser, like the inventor of a behavior rating 
scale, assumes that all the subdivisions or 
items of his scale have equal importance. 
Wechsler, for example, states that “the best 
assumption to make about the separate tests 
of the (Wechsler-Bellevue) scale was that 
they were equally important” (5, p. 116). It 
would have been more correct to use the word 
“easiest” in place of “best.” When we add to- 
gether all the item ratings into a total index 
of behavior we are making a similarly un- 
provable or unproved assumption. 

However, it should be remembered that this 
score is not to be used like an IQ for deter- 
mining an individual’s comparative status, but 
rather for measuring changes in his behavior. 
The rating-scale approach has obvious defects, 
but like the intelligence scale it is the best 
measure at present available. Nor should it 
be forgotten also that the so-called objective 
tests were originally validated against subjec- 
tive judgments of mental ability. 

The descriptive terms used in our scale are 
well within the comprehension of psychiatric 
aides, but their interpretation is assisted by 
use of a guide. We have found it best to rely 
on aides’ judgments rather than on those of 
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more highly trained professional personnel. 
Cumming and Cumming (1) have recently 
cited some evidence that aides’ predictions of 
readmissions of discharged patients are more 
reliable than those of others much higher in 
the supervisory chain of command. If in this 
respect they do better than psychiatrists, their 
more intimate social contacts with patients 
should make them better judges of specific 
ward behavior. 


Application of the Scale 


In our study, aides were given practice ses- 
sions in rating several patients and asked to 
support their judgments as far as possible by 
objective observations. Aides who betrayed a 
tendency to make stereotyped judgments as 
shown by excessive grouping of ratings at 
some particular position on the scale were not 
used in this research. It soon became appar- 
ent that certain individuals of long experi- 
ence showed exceptional balance and that 
nothing was gained by pooling their ratings 
with those of less reliable judges. In the end, 
two aides only were selected as raters though 
they often consulted with others on the ward. 

Another question which arose at the outset 
of the investigation was that of controls. In 
our research design, two closed wards for 
chronic psychotic males were selected for the 
experiment, and it was decided to give 300 mg. 
of chlorpromazine daily to all 80 patients on 
Ward H and placebos to an equal number on 
Ward I. As a matter of practical convenience, 
this plan seemed better than using half the 
patients on each ward as controls. It is so 
much easier to give the drugs or placebos 
from a single bottle three times a day, than 
to have two groups of individuals listed, and 
lined up so as to receive their pulvules from 
the right bottle. Supervision would be still 
more difficult if each patient had his indi- 
vidual bottle. 

The populations of the two wards were 
generally comparable. Patients in each ranged 
in chronological age from about 25 to over 80 
years, length of residence, between the ex- 
tremes of 2 and 57 years. Both wards con- 
tained a wide variety of psychiatric types, the 
majority being schizophrenics. Most had been 
subjected to all kinds of psychiatric treat- 
ments without lasting benefit. They repre- 
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sented ail ethnic groups resident in Hawaii— 
Caucasian, Japanese, Chinese, Koreans, Fili- 
pinos, Puerto Ricans. In short, the subjects 
could be considered, except from the racial 
angle, typical of the population of the “back 
wards” in any state mental hospital. Only 50 
patients were finally included in each of the 
experimental and control groups, but their 
names were chosen at random. Because so 
little was known of the limitations of the drug, 
Dr. Kimmich, the medical director, decided 
that the most democratic procedure was to 
give all the patients on Ward H the advan- 
tages, if any, of medication. Two ratings with 
a three-week interval were collected on each 
patient before the drug was administered, and 
the scores were averaged to give the premedi- 
cation rating. The drug was then adminis- 
tered over an eighteen-week period. Finally, 
to smooth out the effects of fluctuations in 
improvement, it was decided to average the 
postmedication ratings in pairs, the 3- with 
the 6-week, the 9- with the 12-week, and the 
15- with the 18-week scores. 

The double-blind system of keeping both 
aides and patients in ignorance as to which 
ward was receiving the drug and which the 
placebos was adopted, but it was not very 
long before the efficiency of any control sys- 
tem based on placebo administration came 
under serious question. In the first place, 
after the first six weeks, the comparative ab- 
sence of side effects, or of any definite trends 
as regards changed behavior on the placebo 
ward made the true situation very plain, thus 
robbing the plan of its double-blind charac- 
ter. But this is not the most serious objection 
to the placebo procedure. 


Suggestibility and the Placebo Effect 


Adopting placebo administration as part of 
the research design is intended to diminish 
the effects of suggestion. The essence of any 
control plan is that the two groups should be 
equated, but as far as we know, there is no 
way of dividing patients into groups on the 
basis of their suggestibility since we have no 
adequate measure of this trait at hand. 

Whenever an individual is given any form 
of therapy, including medication, with the 
statement that it will do him good, the ex- 
pectation thus set up tends to be fulfilled. 
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This fulfillment differs in degree with differ- 
ent people. Moreover, it affects raters’ judg- 
ments also. If the initial impact of the drug is 
strong, their tendency is to give patients bet- 
ter ratings than are justified. If the effect lev- 
els off or subsides, then the disappointed ex- 
pectations of both patients and raters are re- 
flected in ratings lower than they should be. 
Suggestion also plays an important part when 
placebos are administered, so much so that 
their use as a method of control is futile. 
Rosenthal and Frank (3) in a discussion of 
the placebo effect point out that successes in 
almost every form of therapy depend to an 
undetermined extent on this factor. As exam- 
ples, they cite two studies in which placebos 
brought about a very marked diminution of 
distress symptoms. Only 55 per cent of cases 
reported a reduction in the incidence of com- 
mon colds after vaccine treatment as against 
61 per cent of those receiving placebos. Simi- 
lar placebo effects were reported for a variety 
of complaints, including distress from ulcers, 
surgical wounds, and chronic headaches. Un- 
favorable or even toxic side-effects have oc- 
casionally occurred with placebos. Rosenthal 
and Frank in their own study of placebo ef- 
fects found that 69 per cent of their cases 
showed decreased blood pressure, 19 per cent 
an increase, while pulse rate rose in 25 per 
cent after placebo administration. They sum 
up the implications of these results as follows: 


We need to learn more about the nature of the 
placebo effect, the conditions giving rise to it, and 
the attributes of patients most susceptible or resist- 
ant to it, so that we may obtain a better understand- 
ing of the role of the non-specific factors in psycho- 
therapy (3, p. 300). 


In default of this knowledge we have not 
plotted nor reported our placebo results ex- 
cept in one table. We would also point out 
that the method of giving one group of pa- 
tients on a ward chlorpromazine and giving 
another group nothing does not control nega- 
tive suggestion. A very common and impor- 
tant inference would be that patients who re- 
ceive no treatment would very likely get worse. 


Results 


Figure 1 is a composite graph showing the 
course of changes in 9 of 11 traits of our 
scale, with the average of two premedication 
ratings shown at the left. The other three 
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Fig. 1. Changes in 50 patients in behavior trait 
ratings, after chlorpromazine. 


points of the graph give the averages of the 
3- to 6-week, the 9- to 12-week, and the 15- 
to 18-week ratings. Two traits are not rep- 
resented, one being asocialization (withdrawal- 
obtrusiveness), which showed no change, and 
compulsive behavior, the course of which was 
so irregular as to cast doubt on the validity 
of the ratings. The figure is divided into two 
halves so as to avoid overlapping of lines. It 
should be remembered that the points marked 
are averages of ratings for the whole group of 
50 patients, and that the higher the vertical 
position the more severe the manifestation of 
the trait in question, while the steepness of 
the slope to the right indicates degree of rated 
improvement. Zero on the graph means ab- 
sence of the trait, and as could be expected 
this is only approached in two distinctively 
psychotic behavior traits, hallucinations and 
delusions. Since these are rated inferentially 
rather than by their more overt manifestation, 
this approach to zero must be interpreted cau- 
tiously. It means disappearance of the symp- 
tom rather than nonexistence or complete 
cure. 

It is worth noting that mental confusion 
and speech disorders are relatively the least 
ameliorated by chlorpromazine and are prob- 
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ably the closest related to intellectual proc- 
esses. Here is indirect evidence that the drug 
acts more on the basal ganglia or the cortico- 
thalamic arcs than on the cortex. The analogy 
between lobotomy and chlorpromazine effects, 
particularly decline in vigilance and planning 
as measured by the Porteus Maze Test, has 
recently been discussed in the pages of this 
Journal (2). Self-protective alertness seems 
thus to depend on intact corticothalamic nerve 
pathways. In passing, it may be mentioned 
that continued research at the Kaneohe Hos- 
pital confirms the discovery that serious losses 
in Maze performance accompany prolonged 
administration of chlorpromazine on a 300 
mg. daily dosage. 

The fact that changes in restlessness and 
aggressive behavior as shown by the graph 
parallel each other so closely indicates an in- 
teresting relationship. Internal stress may ex- 
press itself in either form of behavior. That 
improvement in these traits is so steady is 
one of the reasons for calling chlorpromazine 
a tranquilizer. But the fact that negativism 
and mutism are also improved shows that the 
drug stimulates as well as depresses. 
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Fig. 2. Group and individual profiles of behavior 
rating scores. 
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Emotionality, which as observed, was mainly 
depression, has a rather irregular course. There 
is an initial worsening during the first six 
weeks of medication followed by a marked 
improvement in the second six weeks, with a 
tendency to decrease or ‘evel off in the third 
six weeks. Negativism and untidiness follow 
similar courses at about the same psychotic 
level, which also might be expected. 

It may be that vertical height on the graph 
represents not so much severity of psychosis 
as the facility with which the traits may be 
observed by psychiatric aides. There is no 
doubt about the existence of mental confusion 
in a patient, while the overtalkative and the 
mute easily attract attention. On the other 
hand, the most difficult traits to rate are de- 
lusions and emotional states, especially de- 
pression, since their manifestations are less 
overt. The initial deepening of depression, 
like the early sharpening of delusions and 
hallucinations have been noted in general 
fashion in other studies. 

Figure 2 is also a composite of four graphs. 
The first gives the over-all picture for the 
whole group as regards total scores averaged 
at the same points or time intervals as be- 
fore; the other three are individual profiles 
that seem typical of the various kinds of 
behavioral change. Case 32 exhibited steady 
and marked improvement during the first 12 
weeks, followed by a period of somewhat 
slower progress. Case 31 showed the same 
marked improvement after a slower begin- 
ning and ended in a plateau or slight regres- 
sion. Another type of change is that illus- 
trated by Case 34 who showed an initial 
regression during the first six weeks of medi- 
cation followed by a dramatic improvement 
which resulted in discharge from the hospital. 
As will be seen, the profile of Case 32 ap- 
proaches most nearly the group results. A 
fourth type characterized by initial regres- 
sion, then improvement, and later regression 
was not plotted as being too irregular to in- 
dicate any trend. 


Case Histories 


The histories of the three patients whose 
courses of improvement are charted have been 
summarized as follows: 


The hospital record of Case No. 31 states that at 
13 years of age this patient suffered head injuries 
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from a fall, and that thereafter his memory was 
affected. In 1946-48 he was treated at other hospitals 
with 40 electric shock treatments (ESTs) recorded. 
His IQ was reported as 96. He was experiencing 
visual and auditory hallucinations and complained 
of sexual troubles. In November, 1950 he assaulted 
another patient and had ESTs at irregular intervals 
up to 1951. In February, 1956 he went on chlor- 
promazine treatment and during the experimental 18 
weeks progressed from a rating of 29 to 16 points, 
rated as moderate improvement. His Maze scores 
were successively 154, 124, 7, 6 years, a dramatic de- 
cline. He is still on the drug, 300 mg. daily. In the 
third six weeks of medication improvement reached 
a plateau. The most marked changes took place in 
aggressiveness and hallucination. 

Case No. 32 was admitted to the hospital in Feb- 
ruary, 1954 and for two months was put on serpasil. 
His diagnosis was schizophrenic reaction, paranoid 
type. He was reported to be destructive, untidy, ag- 
gressive, with incoherent, rambling speech, marked 
psychomotor activity, no insight nor judgment, and 
assaultive. Later in 1954 he was autistic and disor- 
ganized with delusions of grandeur. In 1956 he was 
given 63 insulin treatments with 57 comas, and 17 
ESTs. On chlorpromazine from February, 1956 he 
showed considerable improvement, being reported as 
cooperative and cheerful. From an initial score of 26 
his rating fell to 11, a change of 15 points which 
puts him in the “marked improvement” class. The 
course of decline in psychotic behavior was steady. 

Case No. 34, a veteran of Japanese ancestry, was 
hospitalized at Lebanon Hospital in 1946 with the 
diagnosis of schizophrenia, affective type. He devel- 
oped delusions of grandeur, had a persecutory com- 
plex, and was destructive and combative. The mother 
was said to be overprotective. He married a girl 
in Wisconsin, had three children, and was divorced. 
Auditory hallucinations with strong delusions and 
ideas of reference appeared at the end of 1954. He 
was given 31 insulin comas and 16 ESTs. At one 
hospital he underwent 12 ESTs and 50 insulin comas. 
The staff psychiatrist noted: “My recent observation 


of the patient is that he is worse despite all the 
treatment.” Thorazine treatment was begun in Oc- 
tober, 1955, and the dose was increased in June, 
1956, to 600 mg. daily. He then changed from a be- 
havior rating of 28 in February to 33 points in six 
weeks, but later insight improved, followed by a 
rapid improvement to a 2 point rating, and ultimate 
discharge. He is working as a restaurant cook. His 
premedication Maze score was 14 years which, when 
examined six months later, rose to 144 years. In 
spite of the rise in score his second test was some- 
what similar to some lobotomy patients. His intelli- 
gence scores at Lebanon were Wechsler-Bellevue Full 
scale IQ 93, Verbal, 97, Performance, 89. For two 
months in 1955, he was given serpasil. His improve- 
ment cannot therefore be related solely to chlor- 
promazine though it was finally the effective agent 
in recovery. Because of the unusual Maze improve- 
ment without any decline, it will be interesting to 
note his further social history. Ward behavior im- 
proved in every respect except asocialization. 


Numerical Presentation of Results 


Finally, the aides’ graphic ratings have 
been converted into numerical equivalents 
and the changes have been grouped in five- 
point divisions and presented in Table 1. To 
allow for possible rating errors, changes from 
0 to 4 points have been regarded as negligible, 
while changes from 5 to 9 points of the scale 
are considered slight or insignificant. Changes 
from 10 to 14 points are classified as moder- 
ate, 15 to 19 as marked, 20 and above as 
representing very great improvement. The av- 
erage change in all 50 cases is approximately 
12 points, so that to cut off changes up to 9 
points as being insignificant is a very con- 
servative procedure. 

On this basis, 60 per cent of these chronic 


Table 1 


Ward Behavior Changes Summarized 


N = 50 in each group 








Rating Point Controls Chlorpromazine Group 

Changes Worse Better Worse Better 

Oto 4 (Negligible 16.6% 16.6% 4.0% 12.0% 
Sto 9 (Slight 5.5% 4.5% 24.0% 
10 to 14 (Moderate 5.5% 8.3% 26.0% 
15 to 19 (Marked 3.0% 26.0% 
20 plus (Very marked 8.0% 
Above 9 (Significant) 5.5% 11.3% 60.0% 
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male psychotics showed significant improve- 
ment as against 11 per cent of the placebo 
group. It would be fair to say that over and 
above the placebo effect, 49 per cent, or ap- 
proximately half the group, were obviously 
improved in ward behavior after their daily 
doses of 100 mg. of chlorpromazine adminis- 
tered over a period of 18 weeks. Considering 
the mental condition and hospital history of 
the group, this must be considered a very re- 
markable result. It would certainly justify the 
opinion of Tarumianz (4, p. 92) that the 
“ataractic” drugs have ushered in a new era 
in psychiatry. 

If we abandon this conservative attitude 
and consider any decline in score as improve- 
ment, then 96 per cent of the chlorpromazine 
patients showed some degree of improvement 
as against 60 per cent of the placebo group. 
The writer, however, considers reporting in 
such terms misleading. The percentages are 
quoted merely to show perfect accord with 
the statement by Rosenthal and Frank that 
improvement in neurotics with various forms 
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of psychotherapy hovers around 60 per cent, 
while the same percentage is reported for 
placebo effects in colds and headaches. Since 
the placebo is of so little value as a control 
device, results following its use have not been 
plotted in this study. 
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Use of the Semantic Differential with Lobotomized 
Psychotics’ 


Catherine B. Semans’ 
Athens State Hospital 


This study undertook to use a form of Os- 
good’s semantic differential (1, 2) with se- 
verely ill psychotics at Athens State Hospital, 
Athens, Ohio, who were candidates for trans- 
orbital lobotomy. 

Ten concepts, selected to represent areas 
presumably affected by transorbital lobotomy, 
were rotated against 15 pairs of adjectives, 
12 of which were identical with those used by 
Osgood to cover the good-bad, active-passive, 
strong-weak, and pleasant-unpleasant dimen- 
sions of meaning. Each of the resulting 150 
items was arranged with a 7-point rating 
scale. Each patient was tested individually 
and was asked to rate each concept according 
to his feelings. 


Example: ME: fast—:—:—°.....- =... 0. 


The patient was asked to place a check mark 
to show whether he thought himself as “very 
fast,” “quite fast,” etc., to “very slow.” The 
same procedure was followed for all of the 
150 items. Time required to complete the rat- 
ings was recorded along with behavior notes. 
Half of the patients completed two ratings be- 
fore operation and one after, while the other 
half completed one rating before and two 


1An extended report of this study may be ob- 
tained without charge from Mrs. Catherine B. Se- 
mans, Chief Psychologist, Athens State Hospital, 
Athens, Ohio, or for a fee from the American Docu- 
mentation Institute. Order Document No. 5184, re- 
mitting $1.75 for microfilm or $2.50 for photocopies. 

2The author wishes to thank Dr. George Klare, 
Ohio University, for his supervision, suggestions, and 
criticisms. 


after. Only 15 of 32 lobotomized patients were 
able to complete the ratings, the others being 
too withdrawn, belligerent, or ignorant to co- 
operate. Seven patients who were candidates 
for the operation, but whose families did not 
give permission for it, formed a small control 
group. 

It was found, first, that significantly greater 
change in concept ratings occurred when op- 
eration intervened between repetitions than 
when ratings were repeated without interven- 
ing operation. Second, time spent in complet- 
ing the rating scale was very greatly reduced 
following operation; this result suggests the 
need for caution in using time measures alone 
for indication of change following lobotomy. 
Third, atypical performance on the semantic 
differential may be related to pathological re- 
sults of lobotomy which are not immediately 
apparent clinically. Fourth, the semantic dif- 
ferential needs to be simplified for use with 
severe chronic psychotics, but can be success- 
fully used in its present form with psychotics 
in reasonably good contact as a measure of 
attitude changes. Exploratory work is being 
done in developing simplified forms. 


Brief Report. 
Received November 21, 1956. 
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A Scale for Measuring Minimal Social Behavior' 


Amerigo Farina, David Arenberg, and Samuel Guskin 
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Chronic deteriorated patients have been re- 
ceiving increased attention since the advent of 
the tranquilizing drugs. The problem of evalu- 
ating the effects of these drugs on this type of 
patient has become acute. Recently the au- 
thors were called upon to evaluate the thera- 
peutic effects of a new drug, promazine, on 
such patients (4). The existing scales seemed 
ill-suited for use at this very low level of 
functioning. 

A scale was constructed on the assumption 
that inappropriate interpersonal behaviors are 
an important facet of serious psychiatric ill- 
ness. The items selected for this scale are in- 
tended to tap those habit patterns which are 
disrupted only in the extremes of pathology. 
It is assumed that clinical improvement will 
include the reinstitution of such simple con- 
ventional behaviors as shaking hands. A brief 
standardized format designed to look like an 
interview was selected for the administration 
of the scale. It was considered a convenient 
sample of how well the patient can deal with 
people more generally. The examiner can ad- 
minister the scale quickly and without previ- 
ous acquaintance with the patient. 


Minimal Social Behavior Scale ? 


The scale is administered in a room con- 
taining a desk, two chairs, a waste paper 


1The authors are indebted to Drs. Warren W. 
Webb and Robert A. Wagoner, and to Mr. Paul B. 
Fiddleman for advice and help in completing this 
study. 

2 A guide for the administration and scoring of the 
scale and also individual item data have been pre- 
pared and are available upon request. Items 31 and 
32 were added subsequent to the promazine study. 

A 16 mm. sound training film (12 min. long) de- 
signed as an aid in administering and scoring MSBS 
is available. For further details write to senior au- 
thor. 
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basket, and nothing more. The patient’s score 
is the total number of items scored plus. 


1. An aide brings the patient to the open door and 
introduces him. Score + if the patient enters and ap 
proaches the examiner. 


2. The examiner says, “Hello, Mr. —_—.” Scor« 


any discriminable response to the greeting, verbal or 
otherwi 
3. Scout if the response to the greeting is verbal 


and appropriate. 


4. The examiner extends his hand. Score if the 
patient shakes hands. 
5. The examiner says, “Won’t you have a seat?”’ 


Score + if the patient sits without further urging 

6. The examiner says, “How are you today?” Scor 
+ any discrirainable response to the question, vert 
or otherwise. 

7. Score + if the response to the question is verbal 
and appropriate. 

8. The examiner drops a pencil by pushing it off 
the edge of the desk, ostensibly by accident. If tl 
patient does not pick up the pencil spontaneously, 
the examiner says, “Will you please pick up that 
pencil for me?” Score + if the patient picks up th« 
pencil. 

9. Score + if the patient picks up the pencil spon 
taneously. 

10. The examiner says, “Would you mind moving 
your chair closer?” Score + if the patient moves th 
chair closer to the examiner. 

11. The examiner holds in front of the patient a 
drawing of a three-inch square with diagonals. Th 
examiner says, “I have something I want to show 
you.” Score + if the patient looks at the drawing 

12. The examiner offers a pencil to the patient and 
says, “Here is a pencil.” Score if the patient 
cepts the pencil without further urging 

13. The examiner places a pad of blank paper in 
front of the patient and says, “I would like you to 
copy this drawing on this paper.” Score + if the pa- 
tient makes any mark on the paper. 

14. Score + if the patient draws any four-sided 
figure with diagonals and nothing more. 

15. The examiner proffers an opened pack of ciga 
rettes to the patient and says, “Cigarette?” Score 
any response which indicates acceptance or refusal 

16. The examiner says, “How are you getting 
along?” Score + any recognizable response to the 
question, verbal or otherwise. 


al 


-+ 


ac- 


+ 
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17. Score + an appropriate verbal response to the 
question. 

18. The examiner crumples a sheet of paper, tosses 
it at the waste basket, purposely missing, and says, 
“Damn it, missed again.” Score if the patient re- 
sponds with a smile or a laugh. 

19. Score + if the patient spontaneously picks up 
the paper and deposits it in the waste basket. 

20. Score if the patient makes any verbal re- 
sponse, irrespective of content, to all of the follow- 
ing questions: 


(a) What year is it? 

(b) What month is it? 

(c) What day is it? 

(d) What season is it? 

(e) Where are you? 

(f) What is the nearest city? 

(g) Who are some people you know around here? 
(hk) Do you hear voices now? 

(i) Is anybody making trouble for you? 


21. The examiner places a magazine, such as Life, 
in front of the patient and says, “I'll be busy a 
minute.” The examiner busies himself with paper 
work. Score + if the patient turns at least one page 
of the magazine. 

22. The examiner feigns a headache by rubbing 
his head with his hands, assuming a pained expres- 
sion and ostensibly attempting to shake off the pain. 
Score + any verbal response which includes the con- 
tent of “head” or “pain.” 

23. The examiner places a cigarette in his lips and 
fumbles for matches. He stands up and pats his 
pockets. A book of matches has previously been 
placed within easy reach of the patient. Score + if 
the patient offers or calls attention to the matches 
or offers a light from his own cigarette. 

24. The examiner rises and offers his hand, saying, 
“Thank you very much, Mr. ’ Score + if the 
patient rises from his chair. 





25. If necessary, the examiner says, “Go ahead, 
Mr. - ,’ indicating the door. Score + if the pa- 
tient opens the door and crosses the threshold with- 
out further urging. 





26. Score + unless inappropriate grimaces or man- 
nerisms are readily apparent. 

27. Score + if the patient at any time looks the 
examiner in the eye. 

28. Score + unless the patient obviously appears 
to avoid the examiner’s gaze at any time, or stares 
at the examiner fixedly. 

29. Score + unless the patient sits in a bizarre 
position, is in constant motion, or is nearly motion- 
less. Do not confuse with item 26. 

30. Score + unless the patient’s clothes are obvi- 
ously disarranged, unbuttoned, or misbuttoned. 

31. Score + unless the patient is obviously drool- 
ing or nasal mucus is clearly visible or unless food 
deposits are conspicuous on clothes or face. 

32. Score + unless the patient rises from his seat 
and moves away from the examiner before the ter- 
mination of the interview without an explanation. 


In the previous study (4), a high degree of 
agreement was obtained between two raters 
scoring the same sample of behavior. With 
the examiner interviewing the patient and an 
observer simultaneously rating from behind a 
one-way screen the correlation was .96 (NV = 
15). A test-retest correlation of .87 (V = 13) 
was obtained when an examiner retested the 
same patients after a seven-week interval. Also, 
the scale was able to detect, at a statistically 
significant level, changes which, presumably, 
resulted from the treatment program. The 
thirty items were examined for internal con- 
sistency; none was found to be in disagree- 
ment with the total score. The data from the 
promazine study, then, indicated high inter- 
rater agreement and stable test-retest scores, 
and suggested that minimal changes were 
measurable using this scale. 

This paper is a report of a more thorough 
investigation of the scale itself. The primary 
aims were to obtain an indication of the va- 
lidity and particularly to determine its reli- 
ability with patients functioning at a low 
level of adjustment. A further aim was to 
compare the reliability obtained with such 
patients with that of another scale in current 
use. The scale selected for this purpose was 
the Hospital Adjustment Scale (HAS). The 
HAS is an inventory listing ninety ward be- 
haviors. It is designed to be filled out by aides 
on the basis of prolonged contact with the pa- 
tients. 


Method and Procedure 


Four groups of male patients were selected 
in order of increasing pathology. Groups A, B, 
and C were all from one ward housing chronic 
patients and were selected by the ward psy- 
chologist. Group D was selected from a ward 
where the patients were so regressed that they 
required considerable nursing care. Group A 
(N = 5) was composed of patients consid- 
ered nearly ready to go home. Group B (N 
= 15) consisted of more seriously ill patients. 
Group C (N = 20) was composed of the least 
adequately functioning patients on the ward. 
Group D (N = 20) was selected from another 
ward where, as noted, the patients were even 
less able to care for themselves. Pertinent in- 


formation about the groups is summarized in 
Table 1. 
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Table 1 


Summary of Pertinent Information About 
Patient Groups 


Grades Years current 

Age completed hospitalization 

Group Mean SD Mean SD Mean SD 
A 28.3 5.4 13.8 3.1 LS 1.2 
B 34.9 5.9 95 2.9 > et 
Cc 35.0 5.7 7.8 3.8 93 54 
D 37.2 43 7.7 2.6 96 44 


The Minimal Social Behavior Scale (MSBS) 
and the HAS were both administered to all 
four groups. In addition both scales were re- 
administered approximately one week later on 
groups B, C, and D. For each patient differ- 
ent interviewers were involved in the read- 
ministration of the MSBS, and different aides 
filled out the second HAS. Furthermore, in 
the second MSBS interview groups B and C 
were also rated simultaneously by an observer 
from behind a one-way screen. 


Results and Discussion 


The correlation between the interviewer 
scores and the scores obtained by the observer 
from behind the one-way screen was .95 (N 
= 35). This agrees closely with the correla- 
tion of .96 previously obtained with the 30- 
item scale (NV = 15). Excellent interrater 
agreement is indicated. 

Correlations between different interviewers 
rating the same patients one week apart are 
given in Table 2. Comparable correlations ob- 
tained by different aides independently rating 
the same patients with the HAS are also listed. 
(Group A was rated only once.) The lowest 
test-retest correlation for each scale occurred 
for group D. It was expected that the HAS 
would be less well suited for patients func- 


Table 2 


Interexaminer Correlations 


HAS 


~ Group ~MSBS 
B 80 61 
C 84 84). 
D 63 42} 


* Difference significant at the .05 level. 
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Table 3 


Group D Mean Scores and Differences 


MSBS HAS 
Raters 1 18.50 26.70 
Raters 2 18.80 16.05 
Difference 30 10.65 
t 34 3.58 
D Not 0] 


f 


significant 


tioning at a very low level, and this seems to 
be the case. The lower reliability obtained 
with the MSBS on these subjects might sug 
gest that it, too, is a less effective measure at 
this level. However, as noted above, in the 
previous study (4) a reliability 
with a similarly pathological group yielded a 
correlation coefficient of .87 (VN = 13). These 
differences between the MSBS and the HAS 
correlations are not statistically significant 
with samples of this size. However, the data 
suggest that the MSBS may be more reliable 
with patients functioning at this low level. A 
question about the suitability of the HAS at 
this level is raised by the disparity between 
the two means obtained on group D by the 
two aides (see Table 3). The difference is sig- 
nificant at the .01 level. In contrast to this 
the difference in mean scores obtained by the 
MSBS raters is minimal. At this low level of 
adjustment the MSBS seems to be a more re- 
liable instrument than the HAS. 

The means for the initial ratings on the 
four groups are given in Table 4. An analysis 
of variance among these four groups resulted 
in an F ratio of 9.07, which indicates that 
these groups differed significantly at the .001 
level of confidence. Furthermore, the means 
decreased from groups A through D as pre- 


test-retest 


Table 4 


Means for Initial MSBS Ratings 





Group N Mean 
A 5 29.8 
B 15 24.6 
Cc 20 21.9 
D 20 18.6 
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dicted, and this sequence would occur by 
chance less than five per cent of the time. 
This indicates that the scale is capable of 
discriminating various degrees of pathological 
adjustment on a group basis. In general, the 
data indicate that the MSBS may have some 
utility when used with patients who are func- 
tioning at a low level of adjustment. 


Summary and Conclusions 


A scale was constructed to measure social 
behavior in chronic, deteriorated patients. It 
was administered to four groups of patients; 
the groups represented four different levels of 
psychopathology. The scale differentiated the 
four groups significantly and the means de- 
creased as hypothesized. Some comparisons 
between the MSBS and the HAS were made 
with the group at the lowest level of adjust- 
ment. With this group the correlations be- 
tween repeated measures seemed to favor the 


MSBS. The two aides rating this group with 
the HAS obtained significantly different mean 
scores. The mean MSBS scores were much 
less disparate. The scale appears to have some 
utility when used with patients at a low level 
of adjustment. 


Received August 20, 1956. 
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Observer Reliability of Interaction Patterns 
During Interviews’ 


Jeanne S. Phillips, Joseph D. Matarazzo, Ruth G. Matarazzo, 
and George Saslow * 
Massachusetts General Hospital and Harvard Medical School 


This report is the fourth in a series of stud- 
ies (9, 12, 13) dealing with the interview as 
an instrument for personality research. Much 
previous research with the interview has shown 
that, while it yields rich psychodynamic and 
psychotherapeutic material for some types of 
investigation, it is, when entirely uncontrolled, 
nevertheless severely limited as a research 
tool for the study of behavior. Many investi- 
gators have reported little or no agreement 
between two or more interviewers when the 
same sample of subjects was interviewed and 
each individual was rated on predefined vari- 
ables such as anxiety (4), various types of 
defenses (11), or specific diagnosis (1, 7), 
etc. In view of the methodological shortcom- 
ings of the interview which these reports have 
highlighted, our group began to work with 
both a partially standardized interview (8), 
and certain well-defined interview-interaction 
variables, as measured by the Interaction 
Chronograph method. 

The partially standardized interview in- 
volves certain “rules” for the interviewer to 
follow during each of five predefined subpe- 
riods of the interview. Periods 1, 3, and 5 
consist of free give-and-take interviewing, 
while Periods 2 (silence) and 4 (interrup- 
tion) involve stress phases of the interview. 
The principal variables recorded during the 
standardized interview are listed in Table 1. 
It is to be noted that there is no standardiza- 
tion of content in these interviews. 


1 This investigation was supported by a research 
grant (M-1107) from the National Institute of Men- 
tal Health, of the National Institutes of Health, 
i. S. Public Health Service. 

*Now at University of Oregon Medical School. 


The recorded variables include the number 
of interactions of both the subject (S) and 
interviewer; the duration of each action and 
each silence; the adjustment of the partici- 
pants to each other; the frequency of the S’s 
initiative-taking during the silence stress pe- 
riod; the frequency and duration of interrup- 
tions; the frequency of dominances and sub- 
missions, etc. A more complete description of 
the standardized interview, definitions of the 
above-mentioned variables, and a history of 
the development and underlying theory of the 
Interaction Chronograph method will be found 
in previous reports (8, 9, 10, 12, 13). 

To date our research has shown the follow- 
ing. First, there are wide individual differ- 
ences in interaction patterns among Ss. Sec- 
ond, the interviewee interaction variables for 
any given subject are highly stable across 
two different interviewers when the latter 
standardize their interviewing behavior (with- 
out standardizing the content of their inter- 
views), and at the same time, the variables 
are modifiable by planned changes in the 
intra-interview behavior of either interviewer 
(12). Third, the striking general stability and 
specific modifiability of interviewee interac- 
tion patterns which were found for our first 
sample of Ss could be cross validated in a 
second sample (9). Fourth, the stability and 
modifiability were equally striking when only 
a single interviewer was used and the test-re- 
test interval was extended to seven days (13), 
in contrast to the first two studies which em- 
ployed a test-retest interval of a few minutes. 

Before we proceeded with studies designed 
to investigate the “meaning,” or predictive, 
concurrent, and construct validity, of the In- 
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teraction Chronograph patterns, it became ap- 
parent that there was an important methodo- 
logical question which the previous studies 
had not fully answered; namely, what is the 
influence of the observer on the final inter- 
action pattern recorded for any given sub- 
ject? Although the previous three studies had 
yielded high test-retest reliability (stability) 
coefficients for the interaction variables, the 
fact that the reliability coefficients depend 
upon the input of the observer recording the 
two-person interactions raises the question of 
confounding. That is, for any given subject, 
the final interaction description yielded by the 
Interaction Chronograph method may be an 
accurate portrayal of the S’s “true” interac- 
tion during the interview, or it may also, to 
an unknown degree, reflect decisions, biases 
or other response sets of the particular ob- 
server. Other currently used interaction sys- 
tems (6) which utilize a human observer di 
ing the actual ongoing interview or group 
interaction (in contrast to later analysis of 
typed transcripts), have given little attention 
to the important methodological question of 
the observer’s response sets. 

The present study was designed to investi- 
gate the reliability of the Interaction Chrono- 
graph observer’s recording. Other investigators 
working with the method, notably Chapple 
(2, p. 301) and Goldman-Eisler (5, p. 355), 
have recognized the importance of the ob- 
server’s input to the final interactional de- 
scription of the subject. However, no system- 
atic study of the observer has yet been 
reported. Since only one observer had been 
used in all our observations, the question of 
possible minimizing of interviewee variability 
through the observer’s own constant response 
sets must be examined. 


Procedure 


The availability of two Interaction Chrono- 
graphs in the personnel department of a large 
department store made possible simultaneous 
but independent recordings of the same stand- 
ardized interview by two observers (Os).° 


8 The authors wish to express their appreciation to 
the members of the Personnel Department of Carson- 
Pirie-Scott, of Chicago; especially to its head, Miss 
Elizabeth Hatch, for her cooperation and assistance, 
and to Miss Louise Mistlebauer, who served both as 
coordinator and observer in this study. 


J. S. Phillips, J. D. Matarazzo, R. G. Matarazzo, and G. Saslow 


One of the Os had had approximately two 
years of experience (involving many hundreds 
of interviews) observing the standardized in- 
terview in an employment setting. The second 
O was relatively inexperienced, having re- 
corded only some 10 practice interviews, and 
these in a psychiatric rather than a depart- 
ment store setting. Preparation of the two Os 
ior the present study consisted of their re- 
viewing the structure of the partially stand- 
ardized interview and the rules for what con- 
stitutes scorable action and inaction. Experi- 
ence has indicated that a major difficulty 
arises for the O when he tries to decide when 
an interviewee communication unit (action) 
has terminated and inaction has begun. In 
order to surmount this difficulty and thereby 
make the observations more objective, certain 
rules have been established to aid in the de- 
cision of what is scored as inaction. These 
rules have been published (3, pp. 5—9; 8, pp. 
362-364) and were reviewed by the two Os 
before the study began. 

Following the joint review, standardized in- 
terviews were conducted by three experienced 
interviewers over a one-week period with 
seventeen randomly selected Ss. The seven- 
teen Ss, who were being routinely interviewed 
and evaluated by the personnel department, 
consisted of applicants for jobs and employees 
being considered for promotion. The inter- 
views were simultaneously but independently 
recorded by the two Os who sat in a small, 
totally dark room and watched the interview 
through a one-way window. An intercom- 
munication system was used to transmit the 
voices in all but the first three interviews, 
when mechanical failure forced the Os to use 
visual cues alone. Because of the darkness of 
the Os’ room, the use of earphones, and the 
Os’ distance from the recording machines, no 
visual or auditory cues were available to 
either O to indicate the other’s recording, as- 
suring independence of the two sets of ob- 
servations. 


Results 


Table 1 contains the means and sigmas, as 
well as the Spearman rho and Pearson r re- 
liability coefficients for the nine major Inter- 
action Chronograph variables. Although mean 
values (i.e., raw scores divided by number of 
units for each S$) are usually used as scores 
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for individual Ss, it was felt that these might 
obscure differences between the results of the 
two Os. Therefore, individual raw scores were 
used in the computations. Mean values for the 
17 Ss (ie., the means across the 17 indi- 
vidual raw scores) are presented for the total 
interview in Table 1. The reason for present- 
ing values for both rho and r in Table 1 is 
made clear by the results with the variable 
S’s Adjustment. For this variable, due to the 
influence of only 3 cases, r is reduced to an 
insignificant value of .398, while rho, a meas- 


Table 1 


Observer Reliability for Total Interview 








Mean 
raw 
Variable score SD rho r p 
S’s Units 
Obs. X 50.70 18.92 962 .985 .01 
Obs. Y 50.70 18.79 
S’s Action 
Obs. X 4153 847 998 998 01 
Obs. Y 4169 851 
S’s Silence 
Obs. X 434 231 910 980 01 
Obs. Y 416 214 
S’s Tempo 
Obs. X 4584 742 1.000 .999 O01 
Obs. Y 4585 736 
S’s Activity 
Obs. X 3722 990 998 996 01 
Obs. Y 3753 1000 
S’s Adjust. 
Obs. X — 82 14.25 710 398 .01* 
Obs. Y —5.41 22.78 
Int.’s Adjust. 
Obs. X —85.41 76.03 859 944 01 
Obs. Y —73.94 65.10 
S’s Synch. 
Obs. X 41.59 17.10 .928 948 O01 
Obs. Y 32.53 14.44 
Int.’s Units 
Obs. X 47.24 17.26 993 .999 O01 
Obs. Y 47.35 17.55 





_ *For this variable rho is significant at the .01 level, while r 
is not significant due to 3 deviant cases. 

Note: “S” is Subject, “Int.” is Interviewer, ““Obs."" is Ob- 
server. 
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Table 2 
Observer Reliability for Individual Periods 
5’s Units S’s Action S’s Silence 
Period rho r rho r rho r 
1 1.000 .998 958 .967 910 952 
2 A76 .889 988 .997 855 911 
3 972 973 946 970 781 840 
4 710 787 994 999 763 940 
5 984 .984 892 .878 901 .956 
>? = 05, r = 482, rho = 49, 
2 = Ol, r = 606, rho = 64. 


ure which is less sensitive to extreme devia- 
tions, yields a highly significant value of .710. 
Since we have been dealing with interaction 
measures the characteristics of which we are 
only now beginning to determine, and with 
relatively small Ns, we have felt it wise to 
compute both r and rho in our various analy- 
ses. Taken together, the results in Table 1 are 
striking evidence that, even with an inex- 
perienced observer, recordings of Interaction 
Chronograph patterns during standardized in- 
terviews are very reliable. For r, eight of 
the nine variables have reliability coefficients 
above .94, while six are above .98. The results 
are equally striking for rho. 

Since the standardized interview consists of 
five subperiods, it is of interest to ask how 
reliable are the observations for these pe- 
riods in contrast to the interview as a whole. 
Table 2 presents the subperiod observer reli- 
ability for a sample of three of the interac- 
tion variables: the number of S’s Units (ac- 
tions), the duration of S’s Actions, and the 
duration of S’s Silences. The values of r 
within subperiods for these three variables 
range from .787 to .999, despite the fact that 
each is based on only a small time sample of 
the total interview. The one relatively low 
value of rho, .476, for S’s Units in Period 2, 
is a statistical artifact due to a number of 
tied ranks, as can be seen by the high value 
(.889) of the Pearson r for this same vari- 
able. Of the 45 period-variable combinations 
(9 variables times 5 subperiods, of which 15 
are shown in Table 2), 10 of the Pearson r 
observer-reliability values were .99; 20 were 
.95 and above; 28 were .90 and above; and 
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Table 3 


Observer Reliability for Initiative, Dominance, 
and Quickness 








Mean 
raw 
Variable score SD rho r p 
S’s Initiative (2) 
Obs. X 747 281 .765 877 .01 
Obs. Y 8.53 3.39 


S’s Dominance (4) 


Obs. X 3.18 6.66 .532 .588 .05 
Obs. Y 6.59 7.37 

S’s Quickness (2) 
Obs. X —127.59 60.04 .941 954 01 
Obs. Y —128.82 55.34 


40 (89%) were .70 and above.‘ Similar values 
were found for rho. Only one variable, S’s 
Adjustment, Periods 2 and 3, was found to be 
unreliable. These two instances of subperiod 
unreliability appear to be due in part to the 
very restricted number of observations rele- 
vant to S’s Adjustment which occur in Periods 
2 and 3. Since S can fail to adjust (interrupt 
or fail to respond) to the interviewer only 
when the interviewer himself has acted, the 
occurrence of approximately 3 and 5 inter- 
viewer’s Units in Periods 2 and 3, respec- 
tively, meant that S’s Adjustment in these 
periods depended upon very few—3 and 5— 
observations. Therefore, relatively small dif- 
ferences in observing one unit of adjustive 
behavior out of the three instances led to un- 
reliability between Os for these 2 subperiods 
for this variable. Since these were the only 
two instances of unreliability, it can be con- 
cluded that observer reliability is high for in- 
dividual subperiods as well as for the inter- 
view as a whole. 

Table 3 presents the reliability results for 
the three variables which are scored only in 
the stress periods—2 and 4—of the standard- 
ized interview. These variables are S’s Initia- 


*To save printing costs, these 45 period-variable 
correlations have been deposited with the American 
Documentation Institute. Order Document No. 5183, 
remitting $1.25 for microfilm or $1.25 for photo- 
copies. 
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tive, S’s Dominance, and S’s Quickness.° 
These variables, like those in Table 1, are usu- 
ally divided by the number of S’s Units and 
hence express average frequency or duration 
per unit. The individual raw scores were uti- 
lized in the present study, however, as stated 
earlier. It is clear from the values shown in 
Table 3 that, despite the fact that these three 
variables are derived from only a small sam- 
ple of the total interview, the Os nevertheless 
attained considerable reliability (.01 level of 
confidence for S’s Initiative and S’s Quick- 
ness, and .05 level for S’s Dominance). The 
finding of significant but lower observer reli- 
ability for the Period 4 S’s Dominance vari- 
able would seem to support our earlier hy- 
pothesis (12, p. 427) that the “fast pace” of 
Period 4, with both the S and interviewer 
talking at the same time, presents the O with 
the most difficult recording situation. Refer- 
ence to Table 2 of the present study sheds 
further light on this possibility. In this table 
the Pearson 7 correlations for S’s Action and 
S’s Silence in Period 4 are extremely high 
(.999 and .940, respectively), while the r for 
S’s Units is somewhat lower (.787) but still 
at the .01 level. Such results suggest that the 
Os differ by only several hundredths of a 
minute in recording how long an S speaks 
and is silent in Period 4, but that the ex- 
tremely small differences in duration occa- 
sionally result in observer disagreements as to 
whether S stopped acting before or after the 
interviewer stopped acting and thus some- 
what reduce observer reliability for S’s Domi- 
nance. Likewise, the very small discrepancies 
in observer input for the duration measures 
may occasionally result in differences as to 
whether S stopped very briefly and then be- 
gan a new unit or was continuously acting in 


5 Initiative: the number of times, out of the avail- 
able number of opportunities (usually 12) in Period 
2, in which S acted again following S’s own last ac- 
tion. Dominance: the number of times in Period 4 
that S “talked down” the interviewer minus the 
number of times the interviewer talked down S. 
Quickness: the length of time in Period 2 that S$ 
waited before taking the initiative following his own 
last action. Quickness is routinely scored in depart- 
ment store applications (employee selection) of the 
Interaction Chronograph and is thus included here 
despite the fact that we have not heretofore used it 
in our own research on interaction patterns during 
interviews. 
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one unit. As is shown in Table 2, however, 
such minor differences in observer-input for 
duration measures apparently have little effect 
on the reliability of some of the variables (S’s 
Units, S’s Silence, S’s Action), even in Period 
4, although they may result in more serious 
differences in the scores obtained for S’s Domi- 
nance (Table 3). However, the fact that S’s 
Dominance could yield an observer reliability 
coefficient at the .05 level of confidence de- 
spite the fast pace of Period 4 implies that, 
with further refinements in definition, and 
possibly more observer practice in Period 4 
interactions, this variable may also be as reli- 
ably observed and recorded as the others. 


Discussion 


Considering the study as a whole, it is clear 
from the results presented that, with the pos- 
sible exception of the S’s Dominance variable, 
the observation and recording of interaction 
patterns during the partially standardized in- 
terview is a highly reliable undertaking. The 
unusually high coefficients of correlation for 
the total interview (Table 1) imply that the 
observer’s task is largely a mechanical one 
once he has read, understood, and practiced 
the published rules as to what constitutes an 
action and an inaction (3, pp. 5-9; 8, pp. 
362-364). Observer response-sets or biases 
appear to have little effect upon the inter- 
viewee interaction record finally obtained. 

The high observer-reliability results, by 
themselves, do not fully answer a second ques- 
tion which motivated the present study: Was 
the high test-retest stability found earlier (9, 
12, 13) for interviewee interaction patterns 
under conditions of the standardized inter- 
view a function of “true” invariance and 
predictable modifiability in interviewee char- 
acteristics, or were these earlier findings in- 
fluenced by the memory of the observer, his 
input in retest being influenced by his mem- 
ory of input in original test? In order to an- 
swer this question better, it seemed that a 
design using one observer for the test series 
and a different observer for the retest series 
would have been superior to the one we used 
(12). However, we were interested, in our 
initial studies, in varying interviewers and 
could not, within the limits of the design 
chosen, also vary observers. The latter, we 


felt, had to be controlled while we studied 
our main dependent variable, a single inter- 
viewee’s interaction patterns across different 
interviewers. 

Despite the fact that a somewhat better 
design exists for testing the possible influence 
of the observer's memory on our previous 
findings, the results of the present study carry 
strong implications which help us to assess 
this potential influence. There are several rea- 
sons why the present results tend to rule 
out any significant influence of the observer's 
memory on the high reliability coefficients of 
the original studies. First, and most impor- 
tant, the results of the present observer-reli- 
ability study suggest that, neither in the case 
of an experienced observer nor an inexperi- 
enced one did observer response sets influ- 
ence the final record of any given S’s inter- 
action. In other words, as mentioned earlier, 
the task of the observer is a more or less 
mechanical one, so that extraneous response 
sets seem to have minimal influence on the 
final record. Second, the interaction variables 
which the observer records during a complex, 
live, two-person interview are swift moving, 
so that, even for a single S, it is hard to see 
how, during initial test, the observer could 
ascertain or remember such facts as, for ex- 
ample, that S had 87 units: each one on the 
average .57 of a minute in duration; his si- 
lences averaged .09 of a minute; his malad- 
justments to the interviewer took the form 
primarily of interrupting him, doing this 
about 24 times in Period 1; each for an av- 
erage duration of .08 of a minute; he failed 
to synchronize 43 times in all; he took the 
initiative 6 times out of 12 in Period 2 and 
submitted 3 times more often than he domi- 
nated in Period 4; he talked on the average 
46 of a minute per unit in Period 3, but .72 
of a minute per unit in Period 5; etc. If an 
observer were capable of absorbing all this in- 
formation merely from observing an S in the 
initial interview (the observer never saw the 
machine record), which it is our belief he is 
not, he would still have to translate such data 
in all their complexity accurately from recall 
while the retest interview was going on, in 
order to ascribe the earlier findings of sta- 
bility to his memory rather than ascribing 
them to patient invariance. Data on human 
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learning would suggest that events as compli- 
cated as those we have been recording are not 
learned and recalled (even five minutes later) 
by a human observer with the high degree of 
reliability found in our first two test-retest 
studies. Third, even if it were possible for an 
observer to learn and remember the complex 
events investigated in the first two studies 
until retest a few minutes later, it is hardly 
likely he could have remembered such a com- 
plicated set of events over a seven-day pe- 
riod, as in our third study (13). This is espe- 
cially the case since he might have observed 
as many as 6 or 8 other Ss’ interviews in the 
seven days between test and retest. The in- 
fluence of both proactive and retroactive in- 
hibition would certainly tend to cancel out 
whatever memory “traces” the observer might 
have had. 

In view of these considerations and the re- 
sults of the present study, it seems far more 
probable to us that the role of the observer 
is a mechanical one, and that the stability of 
interviewee interaction patterns which we have 
previously reported is a “true” characteristic 
of interviewee behavior, and not an artifact 
of the method of observation. 

With the completion of the present study 
it is our belief that all major aspects of the 
reliability of the. Interaction Chronograph 
variables have been investigated. Thus to date 
we have studied the reliability of the inter- 
viewer who serves as the independent variable 
by following the rules of the partially stand- 
ardized interview (9, 12); the reliability of 
the interviewee interaction patterns, the de- 
pendent variables (9, 12, 13); the reliability 
of the scorer who scores the final Interaction 
Chronograph record (12, p. 429); and, finally, 
the reliability of the observer’s input (the 
present study). Having answered the ques- 
tions relevant to reliability, we have since 
turned to the question of the validity of the 
interviewee interaction patterns. 


Summary 


The design of our previous investigations 
made it impossible for us to study the pos- 
sible role of the observer in accounting for the 
high interviewee stability coefficients which 
we obtained in a partially standardized inter- 
view. Results of the present study, utilizing 
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one highly experienced observer and another 
observer with only minimal experience, inde- 
pendently and simultaneously observing the 
same 17 interviews, indicate that the role of 
the observer is a highly reliable, almost me- 
chanical one. For the interview as a whole, 
8 of the 9 major interaction variables yielded 
observer reliability coefficients (Pearson rs) 
above .94. Equally high values were found for 
many of the five subperiods of the interview, 
while only one variable, S’s Adjustment, was 
found to be unreliable in any subperiod. The 
variable S’s Dominance in the fourth (inter- 
ruption) period showed only moderate (.05 
level of confidence) in contrast to the high 
reliability of the other major variables. With 
the present demonstration of observer reli- 
ability completed, and the earlier demonstra- 
tions of interviewer, interviewee, and scorer 
reliabilities, we plan now to devote further 
research efforts to the question of the validity 
of the interview interaction variables. 


Received March 15, 1957. 
Early Publication. 
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Manifest Anxiety and Test Taking Distortion 
of the Blind’ 


Sidney I. Dean 


University of Portland 


An earlier study of blind subjects (1) in- 
dicated that the MMPI mean score psycho- 
graphs for both sexes yielded patterns within 
normal deviations on all clinical scales. This 
study utilizes two measures derived from the 
MMPI. The blind are said to be an anxious 
group; the Taylor Manifest Anxiety Scale 
(MAS) was used to investigate this. The 
Gough raw F minus raw K (F — K) was used 
to measure the defensive nature of the blind 
test-taking attitude. 

The subjects were 34 male and 20 female 
blind. They varied from totally blind to 
“good” travel vision, from blind at birth to 
recent acquisition, from good to poor adjust- 
ment by judges’ rating. The MMPI short form 
(first 366 items plus full K and Si scales) 
was administered verbally. Within these items 
there are 38 of the 50 originally used by Tay- 
lor, and manifest anxiety should be indicated 
if present. The F — K would indicate the va- 
riety of defense if it proves to be beyond “nor- 
mal” expectations. 

The MAS produced a multimodal distribu- 
tion with a mean of 11.09 and a median of 
10.66 (estimated with the full 50 items), sug- 
gesting less than “normal” manifest anxiety. 
The F — K yielded a mean of — 13.19 and a 
median of — 13.50, indicating an attempt to 
“look good” and deny that blindness is in- 
capacitating. The Composite score (Comp: 
number and degree of deviations from normal 


1An extended report of this study may be ob- 
tained without charge from Sidney I. Dean, Mental 
Health Clinic, Florence, South Carolina, or for a fee 
from the American Documentation Institute. Order 
Document No. 5179, remitting $1.25 for microfilm 
or $1.25 for photocopies. 


range) (1) gave a mean of 13.43 and a 
median of 13.70. The Comp and MAS are 
correlated .40, beyond the 1% level, and 
may both be expressions of “maladjustment.” 
F — K and MAS are also correlated beyond 
the 1% level at .57; as anxiety increases so 
does the attempt to look good. F— K and 
Comp are not significantly related at .18; as 
adjustment worsens, defensiveness does not 
systematically change. Analyses of variance 
were executed and none produced significant 
Fs. The ¢ tests between sexes indicated a com- 
mon population mean, but F tests for vari- 
ances all differed at the 1% level. 

Taylor has acquired different medians and 
means with different samples and variations 
of her scale. Results from other investigators 
are more helpful in evaluating the findings in 
this study. A deception index has been sug- 
gested for the F — K to include — 11 or below 
as “faking good,” but more definitive work 
has been done on the positive end of the scale. 

The blind appear to differ from both nor- 
mal and clinical groups on the MAS. They ap- 
pear to defend themselves by distortion in the 
direction of “looking good.” Anxiety seems re- 
lated to worsening adjustment and greater de- 
fensiveness. Female blind subjects are more 
variable in their responses. Acuity, occur- 
rence, and adjustment were not differentiated 
by analyses of variance. 

Brief Report 
Received November 26, 1956. 
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Selection of Neuropsychiatric Patients for 
Group Psychotherapy’ 


Leonard P. Ullmann 
VA Hospital, Palo Alto, California 


Commenting on research in group therapy 
during 1955, Harris remarked: “Selection of 
patients for groups continues to be a ‘wide 
open’ area. For each article insisting that a 
certain type of patient is unsuitable for treat- 
ment, there seems to be a corresponding pub- 
lication reporting a positive group experience 
with just such patients” (6, p. 139). The lit- 
erature of patient selection for group therapy 
draws heavily on therapists’ opinions and ex- 
perience (1, 5) and frequently is expressed in 
diagnostic and sociological signs (4, 15). With 
the possible exception of Kotkov and Mead- 
ow’s (8) finding that a patient is more likely 
to remain in treatment if FC is greater than 
CF on the Rorschach, the usefulness of psy- 
chological tests has been asserted rather than 
demonstrated (10, 12, 13). 

This study presents the relationships of two 
thematic tests to criteria of (a) patient be- 
havior in therapy groups within two weeks of 
testing, and (4) hospital status six months 
after testing. In addition to meeting a prac- 
tical need where the number of group thera- 
pists is limited, significant test-criteria rela- 
tionships may generate hypotheses about the 
usefulness of group treatment. 


1 From the Veterans Administration Hospital, Palo 
Alto, California. This report is based in part on a 
dissertation submitted to the Department of Psychol- 
ogy and the Committee on Graduate Study of Stan- 
ford University, May 1955. The writer wishes to ex- 
press his appreciation to Drs. C. L. Winder, Quinn 
McNemar, and Sanford Dean for their generous help 
in this investigation. Material was collected at the 
VA Hospital, Palo Alto; cooperation of the hospital 
staff and Drs. Wesley Becker, Glen Brackbill, Robert 
McFarland and Donald Shannon, who rated test ma- 
terial, is gratefully acknowledged. Part of this paper 
was read at the 1955 convention of the American 
Psychological Association. 


Subjects, Criteria, and Procedures 
Subjects 


As part of an ongoing project,’ all patients 
receiving group treatment at a Veterans Ad- 
ministration neuropsychiatric hospital were 
routinely rated on the Palo Alto Group Ther- 
apy Scale by their group therapists. The pres- 
ent study examined as many of the patients 
as possible on whom group ratings were made 
during an eight-week period. The sample con- 
sisted of 72 patients, 60 of whom were test- 
able, who were administered the two tests 
described below within two weeks before or 
after being rated on the group therapy scale. 
The patients used represent the entire range 
of adjustment as rated on the Palo Alto 
Group Therapy Scale and came from twelve 
groups with ratings made by ten therapists 


Criteria 


Finney (3) has described the construction 
and valid?*ion of a method for rating pa- 
tients’ be.avior in discussion-type therapy 
groups. The Palo Alto Group Therapy Scale 
measures adequacy of interpersonal relation- 
ships as manifested in the group therapy situ- 
ation. The scale is a checklist of 88 items 
The patient’s score is the number of items 
which indicate good interpersonal relation- 
ships. Finney (3) reported that for 18 groups 
in a neuropsychiatric hospital a median rank- 
order correlation of .84 was obtained between 
scores on the scale and global rankings by 
group leaders. In the same study, a rank or- 
der correlation of .80 was obtained between 
the average ratings by ten ward personnel 


2 Thanks are due Dr. Ben Finney who made data 
from his researches available for this study. 
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of adequacy of interpersonal relationships 
throughout the hospital and the patients’ 
scores on the group therapy scale. The Palo 
Alto Group Therapy Scale offers a way of 
quantifying patient behavior in therapy groups 
in terms which are meaningful within the 
framework of group interaction and within 
the larger context of the hospital. 

In the present study, the criterion of hos- 
pital status divides the sample into (a) those 
patients who, six months after the completion 
of testing, were either discharged or on a trial 
visit which had lasted at least 90 days, and 
(5) those patients who did not meet this cri- 
terion. Of the 60 testable patients, 26 were in 
the improved hospital status group and 34 
were not. Two of the twelve untestable pa- 
tients were in the improved hospital status 
group. While the criterion of improved hos- 
pital status may overlook important gains 
made by patients who remain in the hospital, 
it is used in this study as a rough, practical 
measure of improvement. 


Test Devices 


Two tests, each administered in 15 to 20 
minutes, were used. The first test consists of 
six TAT cards. Cards 4, 6BM, 7BM, 13MF, 
15, and 17BM were selected after a review of 
work by Eron (2) and Weisskopf (16) sug- 
gested that these pictures are ones beyond a 
minimum level of transcendence on which sub- 
jects tend to produce a greater than average 
number and variety of thema. The TAT pro- 
tocols were scored by a clinician familiar with 
the group therapy scale whose task was to 
predict scores on the group therapy scale from 
the total TAT protocol. To find rater reli- 
ability, two other clinicians ranked the first 
third of the 60 TAT protocols as to predicted 
group therapy scores. The rank-order correla- 
tions between pairs of these three raters were 
.76, .71, and .68 for the 20 cases. 

The second test was a new set of stimuli 
designed for this study called the Social Per- 
ceptions Test (14). The test consists of twelve 
line drawings of people who are faced with 
conflicting socially approved reasons for ac- 
tiou which are made explicit by their gestures 
and speech inserted in cartoon “balloons.” 
The conflicts center around a reason for ac- 
tion which is relatively more self-satisfying as 
opposed to a reason which is more self-re- 
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strictive. The subject is instructed that he is 
to take a test of his knowledge of human na- 
ture, and that after looking at each of the pic- 
tures he will be asked some questions. These 
questions are (a) why should and (b) why 
should not the picture hero do some action 
suggested by the explicit reasons given in the 
situation, and (c) how would the person feel 
if he did and (d) did not do the action. The 
answers to the first two questions, which deal 
with reasons for an action, are scored on a 
scale of recognition of social motivations de- 
scribed in detail in a test manual (14). For 
the first 20 testable cases of the sample, the 
reliability of two raters for the scoring of 
recognition was a product-moment correlation 
of .99. Responses to all four questions are 
used to derive a score of the subject’s ability 
to feel motivations in complex social situa- 
tions. This scale is essentially one of the de- 
gree to which the social motivations have been 
internalized and are felt to be important to 
the welfare of the picture hero. The reliability 
of two raters scoring the first 20 testable cases 
of the sample on the scale of ability to feel 
motivations yielded a product-moment corre- 
lation of .92. For both the recognition and 
feeling scores, the scores of the odd num- 
bered pictures correlated .87 with the scores 
of the even numbered pictures, for the 60 
cases in this sample. Correlations between the 
TAT ratings and the Social Perception Test 
scores in this study were .60 for recognition 
and .63 for feeling of social motivations. 


Results 


Estimates of the group therapy scale scores 
from TAT protocols by the expert yielded a 
product-moment correlation of .58 for the 60 
testable cases. Ability to recognize motiva- 
tions as measured by the Social Perceptions 
Test correlated .46 with the criterion of group 
therapy scale ratings for the sample of 60 
cases. The score of ability to feel motivations 
as measured by the Social Perceptions Test 
correlated .59 with the criterion for the 60 
testable patients. The correlations of the TAT 
estimates and the feeling score of the Social 
Perceptions Test with the group therapy scale 
ratings are statistically significant beyond the 
0001 level. Of the twelve untestable cases, 
three were above the median of the group of 
testable patients on the group therapy scale. 





41a 


on <= OD 1 


Pe O ~ 





Selection of Patients for Group Psychotherapy 279 


Untestability may be taken as a tentative in- 
dication of poor prognosis of adequate in- 
terpersonal relationships in a group therapy 
setting. 

In this research, hospital status six months 
after the completion of testing was used as a 
rough measure of improvement, and therefore 
as a possible criterion for the selection of 
group members. Biserial correlations were 
computed for the 60 testable patients using 
scores of the projective techniques as the 
graduated variable. The ratings of the TAT 
and hospital status six months later yielded a 
biserial correlation of .41, which for 60 cases 
is significant beyond the .01 level of statisti- 
cal significance. The biserial correlation be- 
tween recognition and hospital status change 
for the 60 cases was .52, significant at the 
.001 level. The biserial correlation between 
hospital status and the ability to feel social 
motivations as measured by the Social Per- 
ceptions Test was .58 and is significant at the 
.0001 level of statistical significance. Since 
ten of the twelve untestable patients were in 
the hospital at the time of follow-up, untest- 
ability may be tentatively considered a poor 
indication for improved hospital status. Add- 
ing these cases would increase the percentage 
of patients correctly identified by both pro- 
jective techniques. Biserial correlations be- 
tween ratings on the group therapy scale and 
hospital status yielded correlations of .31 for 
the 60 testable patients, and .36 for the total 
sample of 72 patients. These relationships are 
between the .05 and .01 levels of statistical 
significance. The results indicate that the tests 
used in this study, although significantly re- 
lated to the criterion of scores on the group 
therapy scale, add information beyond that 
afforded by this measure when change of hos- 
pital status within six months is the criterion. 
Comparing the cases in which the test devices 
and the group therapy scale ratings were in 
disagreement as to indications of good in- 
terpersonal relationships and therefore, sup- 
posedly, of future positive change in hospital 
status, it was found that the tests used were 
more frequently “hitting” correctly the change 
in hospital status criterion than the group 
therapy scale ratings. After Yates’s correction 
had been applied, the greater number of hits 
by the TAT of the discharge criterion as 
compared to the group therapy scale scores, 


yielded a chi square of 3.35 significant be- 
tween the .10 and .05 level (9, pp. 228-231). 
By the same method, when the group therapy 
scale ratings disagreed with the scores of 
ability to recognize motivations as measured 
by the Social Perceptions Test, the differ- 
ences favored the test score as a predictor of 
changed hospital status by yielding a chi 
square of 4.26 significant beyond the .05 
level. Ability to feel social motivations as 
measured by the Social Perceptions Test cor- 
rectly identified 17 of the 19 cases where there 
was disagreement with the group therapy 
scale ratings as to the criterion of changed 
hospital status. The difference between group 
therapy scale ratings and ability to feel social 
motivations, after Yates’s correction had been 
applied, yielded a chi square of 10.32, signifi- 
cant beyond the .005 level. 


Discussion 


The results obtained in this study may be 
used to draw tentative conclusions as to what 
type of test material is most likely to be of 
value in making predictions to group therapy 
criteria. Thematic test material, which elicits 
responses more similar to the behavior on 
which group criteria are based, gave better 
results in this investigation than did the Ror- 
schach in previous studies cited above. With 
normal subjects, this contrast may also be 
noted. Pepinsky ef al. (11) obtained rela- 
tively poor results with the Rorschach, while 
Horwitz and Cartwright (7) found many sta- 
tistically significant relationships when a pic- 
ture of a small group meeting was used as 
a projective device. Drawing on the present 
study and TAT techniques such as those of 
Zimet and Fine (17, 18), it seems likely that 
data which deal with the intensity and ap- 
propriateness of interpersonal feelings may be 
a particularly germane source of information 
for the selection of patients for group therapy. 
Such test responses offer a pool of behavior 
from which scores can be derived that will 
cut across diagnostic and sociological cate- 
gories and, through the extrapolation of simi- 
lar behaviors, may be related to criteria of 
readiness for or progress in group therapy. At 
the present time, the use of such psychologi- 
cal measures seems to be the best means of 
resolving the muddled situation described by 
Harris (5). 
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The present findings lead to a testable for- 
mulation of the function of group therapy in 
the treatment of hospitalized neuropsychiatric 
patients. If it is assumed that it is easier for 
the patient to express his feelings in the test 
situation than in reality, and that these feel- 
ings are related to social behavior, then on 
the basis of the present findings it may be 
said that the patient who is likely to respond 
favorably to group therany is the one who 
can express appropriate interpersonal motiva- 
tions and feelings. Because symbolic or the- 
matic test behavior was a better indicator of 
improvement as measured by hospital status 
six months after testing, the group situation 
may possibly be thought of as providing the 
patient with an opportunity to experience, un- 
der protected conditions, those social rewards 
which previously had been expressed symboli- 
cally. The present study is not conclusive on 
this point because other hospital events were 
not considered and a measure of overt behav- 
ior at the time of follow-up was not made. 
These steps, however, are feasible. The pres- 
ent investigation offers the conclusion for fur- 
ther study that level of symbolic interpersonal 
motivation provides a measure of potential to 
change and that the group therapy situation 
is an important method of fostering such a 
change. 


Summary 


The present study was concerned with the 
use of thematic test material in the selection 
of hospitalized neuropsychiatric patients for 
group therapy. Measures derived from six 
cards of the TAT and a new test devised for 
this study were significantly related to the 
criteria of overt behavior in the group situa- 
tion within two weeks of testing and hospital 
status six months after testing. There is strong 
evidence that test behavior is a better indica- 
tion of hospital status six months later than 
ratings of overt behavior in the group situa- 
tion made at the time of testing. These results 
are discussed in terms of the type of test ma- 
terial which is most likely to be useful in the 
selection of patients for group therapy and in 
terms of the possible function of group ther- 
apy in the treatment of hospitalized neuro- 
psychiatric patients. 
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New Tests 


Aliferis, James. Aliferis Music Achievement Test. 
College entrants. 1 form. (40) min. Test booklet, 
pp. 8 ($3.00 per 20); scoring key (50¢); tape re- 
cording ($9.50); manual, pp. 28 ($3.00); speci- 
men set ($3.75). Minneapolis: Univer. of Minne- 
sota Press, 1954. 


Evidence in the manual shows that this test is a 
carefully constructed measure of the auditory-visual 
discrimination of melodic, harmonic, and rhythmic 
elements and idioms of music, for use at the college 
entrance level. The test may be administered with 
the piano, but the use of a tape recording is recom- 
mended for uniformity. In the test booklet, the ex- 
aminee makes multiple-choice responses to indicate 
the musical notation of what he hears. Reliability of 
the whole test is reported as .88, and of three sub- 
scores, .84, .72, and .67. Data from four midwestern 
universities show that the total score correlates .61 
with grades in first-year music courses, in comparison 
to a correlation of .25 with grades in academic 
courses. For individual guidance, the author recom- 
mends that the test be used in combination with a 
test of sensory discriminations such as the Seashore, 
an appraisal of performance on an instrument, and 
a verbal intelligence test —L. F. S. 


Gordon, Leonard V. Gordon Personal Inventory. 
High school, college, adult. 1 form. (10-15) min. 
Question booklet ($2.45 per 35), with key, manual, 
pp. 8; specimen set (20¢). Yonkers, N. Y.: World 
Book Co., 1956. 


The Personal Inventory is presented as a measure 
of four factors of personality: cautiousness (C), 
original thinking (O), personal relations (P), and 
vigor (V). It is a companion instrument to the au- 
thor’s previously published Personal Profile which 
obtains scores on ascendancy (A), responsibility (R), 
emotional stability (Z), sociability (S), and over-all 
self-evaluation (7) (J. consult. Psychol., 1954, 18, 
154). Like the earlier questionnaire, the Inventory is 
characterized by brevity, the use of tetrads of state- 
ments controlled for social desirability, and the 
identification of components by factor analysis. The 
method of construction shows a high degree of tech- 
nical competence. The reliabilities of the four new 


scores range from .77 to .88 in college and high school 
groups. The traits are relatively independent of one 
another (— .06 to 47), and of the four traits meas 
ured by the Profile ( 16 to 47). Correlations with 
intelligence are mainly insignificant. Norms, rightly 
called tentative, are based on about 500 cases fror 
each of four groups: high school boys and girls and 
college men and women, all from the same sectior 
of the United States. Validation is discussed only in 
terms of the construct validity inferred from the fa 
tor analysis. Users are cautioned appropriately against 
drawing conclusions from small differences in s 

The two Gordon questionnaires commend themselv: 
favorably for use when economy of time is essential 
Few other instruments obtain as broad a picture of 
self-reported personality in less than 30 minutes 
L. F. S. 


Kuder, G. Frederic. Kuder Preference Record—Occu- 
pational, Form D. High school, college, adult. 1 
form. (20-30) min. IBM or hand scoring. Ques- 
tion booklet, pp. 11 ($9.80 per 20); answer sheet 
($6.25 per 100); occupational keys ($1.00 each) ; 
manual, pp. 12 (50¢); specimen set ($2.00); re- 
search handbook, pp. 47 ($2.50). Chicago: Science 
Research Associates, 1956. 


The newest member of the Kuder family is a blank 
designed to measure the resemblance of the examinee’s 
interests to those of persons in specific occupations 
Like the author’s Personal (Form A) and Vocational 
(Form C) preference records (J. consult. Psychol, 
1949, 13, 67), the new inventory was developed with 
great competence, and its manuals communicate in- 
formation of wide scope. The blank consists of 100 
triads of statements which best represent each area 
of the preceding questionnaires and have minimal 
correlations with other areas. In view of relations of 
all 15 of the areas covered by Form A and Form C 
to occupational choice or job satisfaction, this repre- 
sentative pool of items seems likely to provide a suffi- 
ciently broad base for discriminating among occupa- 
tions. Each statement describes an activity; names 
of occupations are avoided in order to promote 
subtlety of discrimination. The vocabulary is at or 
below the sixth grade level. Each occupational key 
was developed by comparing the responses of at 
least 100 persons in the occupation, usually more, 
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with those of a base group of 1,000 males. For each 
key, cross validation on 100 new cases is reported. 
The scores are expressed as “differentiation ratios”: 
the percentage of confidence with which the ex- 
aminee may be classified as a member of the cri- 
terion group. Keys are presently available for six 
occupations and keys for twelve more are to be pub- 
lished in 1957. An ingenious Verification Key suc- 
ceeds well in identifying subjects who have answered 
carelessly or who have sought to give excessively 
favorable impressions of themselves. The Research 
Handbook is the most exceptional adjunct. It reports 
the development of the instrument in detail. Fur- 
ther, it gives explicit instructions for the construction 
of new keys for local purposes, by differentiation of 
a criterion group from the general norm group, by 
differentiation between two criterion groups, and by 
differentiation according to a criterion extending over 
more than two points. Tables and a nomograph 
needed for key construction are provided, and a set 
of worksheets is available. The Handbook is itself a 
useful textbook on the construction of occupational 
inventories and similar empirical questionnaires. The 
merits of the Occupational record commend its use 
in guidance, employment selection, and research. Its 
few faults, mainly stemming from the limited num- 
ber of keys available, are almost certain to be 
remedied —L. F. S. 


Psychological Test Reviews 


Remmers, H. H., & Bauernfeind, R. H. SRA Junior 
Inventory, Form S (Rev.). Grades 4-8. 1 form. 
(40) min. Question booklet, pp. 8 ($2.00 per 20); 
pupil profile ($1.05 per 20) ; manual, pp. 32 (50¢) ; 
specimen set (50¢). Chicago: Science Research As- 
sociates, 1955, 1957. 

Form S of the Junior Inventory is a significant re- 
vision of the Form A issued in 1951 (J. consult. Psy- 
chol., 1952, 16, 160). Instead of responding by mark- 
ing only the statements which represent problems 
for him, a pupil now indicates whether the item 
states “a big problem,” “a middle-sized problem,” 
“a little problem,” or “no problem.” Weighted scor- 
ing is used. The area of health problems has been 
eliminated as a profile category: on the basis of evi- 
dence from an item analysis which showed no in- 
ternal consistency, and the items have been reas- 
signed to the areas with which they correlate best. 
The pupil now marks in an expendable booklet, in- 
stead of using a separate answer sheet which gave 
trouble to young examinees. The profile includes 
scores in five areas: school, home, myself, people, 
and general. The revised form is clearly an improve- 
ment. It also continues to show the merits which 
characterized the earlier version—the good scope of 
technical information provided in the manual, and 
the modesty of the suggested interpretations — 


L. F. S. 
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